{"title": "Probabilistic low-rank matrix completion on finite alphabets", "book": "Advances in Neural Information Processing Systems", "page_first": 1727, "page_last": 1735, "abstract": "The task of reconstructing a matrix given a sample of observed entries is known as the \\emph{matrix completion problem}. Such a consideration arises in a wide variety of problems, including recommender systems, collaborative filtering, dimensionality reduction, image processing, quantum physics or multi-class classification to name a few. Most works have focused on recovering an unknown real-valued low-rank matrix from randomly sub-sampling its entries. Here, we investigate the case where the observations take a finite numbers of values, corresponding for examples to ratings in recommender systems or labels in multi-class classification. We also consider a general sampling scheme (non-necessarily uniform) over the matrix entries. The performance of a nuclear-norm penalized estimator is analyzed theoretically. More precisely, we derive bounds for the Kullback-Leibler divergence between the true and estimated distributions. In practice, we have also proposed an efficient algorithm based on lifted coordinate gradient descent in order to tackle potentially high dimensional settings.", "full_text": "Probabilistic low-rank matrix completion on \ufb01nite\n\nalphabets\n\nJean Lafond\n\nInstitut Mines-T\u00b4el\u00b4ecom\n\nT\u00b4el\u00b4ecom ParisTech\n\nCNRS LTCI\n\nOlga Klopp\n\nCREST et MODAL\u2019X\nUniversit\u00b4e Paris Ouest\nOlga.KLOPP@math.cnrs.fr\n\n\u00b4Eric Moulines\n\nInstitut Mines-T\u00b4el\u00b4ecom\n\nT\u00b4el\u00b4ecom ParisTech\n\nCNRS LTCI\n\njean.lafond@telecom-paristech.fr\n\nmoulines@telecom-paristech.fr\n\nJoseph Salmon\n\nInstitut Mines-T\u00b4el\u00b4ecom\n\nT\u00b4el\u00b4ecom ParisTech\n\nCNRS LTCI\n\njoseph.salmon@telecom-paristech.fr\n\nAbstract\n\nThe task of reconstructing a matrix given a sample of observed entries is known\nas the matrix completion problem.\nIt arises in a wide range of problems, in-\ncluding recommender systems, collaborative \ufb01ltering, dimensionality reduction,\nimage processing, quantum physics or multi-class classi\ufb01cation to name a few.\nMost works have focused on recovering an unknown real-valued low-rank ma-\ntrix from randomly sub-sampling its entries. Here, we investigate the case where\nthe observations take a \ufb01nite number of values, corresponding for examples to\nratings in recommender systems or labels in multi-class classi\ufb01cation. We also\nconsider a general sampling scheme (not necessarily uniform) over the matrix\nentries. The performance of a nuclear-norm penalized estimator is analyzed the-\noretically. More precisely, we derive bounds for the Kullback-Leibler divergence\nbetween the true and estimated distributions. In practice, we have also proposed\nan ef\ufb01cient algorithm based on lifted coordinate gradient descent in order to tackle\npotentially high dimensional settings.\n\n1\n\nIntroduction\n\nMatrix completion has attracted a lot of contributions over the past decade. It consists in recovering\nthe entries of a potentially high dimensional matrix, based on their random and partial observations.\nIn the classical noisy matrix completion problem, the entries are assumed to be real valued and ob-\nserved in presence of additive (homoscedastic) noise. In this paper, it is assumed that the entries take\nvalues in a \ufb01nite alphabet that can model categorical data. Such a problem arises in analysis of vot-\ning patterns, recovery of incomplete survey data (typical survey responses are true/false, yes/no or\ndo not know, agree/disagree/indifferent), quantum state tomography [13] (binary outcomes), recom-\nmender systems [18, 2] (for instance in common movie rating datasets, e.g., MovieLens or Ne\ufb02ix,\nratings range from 1 to 5) among many others. It is customary in this framework that rows represent\nindividuals while columns represent items e.g., movies, survey responses, etc. Of course, the obser-\nvations are typically incomplete, in the sense that a signi\ufb01cant proportion of the entries are missing.\nThen, a crucial question to be answered is whether it is possible to predict the missing entries from\nthese partial observations.\n\n1\n\n\fSince the problem of matrix completion is ill-posed in general, it is necessary to impose a low-\ndimensional structure on the matrix, one particularly popular example being a low rank constraint.\nThe classical noisy matrix completion problem (real valued observations and additive noise), can be\nsolved provided that the unknown matrix is low rank, either exactly or approximately; see [7, 15, 17,\n20, 5, 16] and the references therein. Most commonly used methods amount to solve a least square\nprogram under a rank constraint or a convex relaxation of a rank constraint provided by the nuclear\n(or trace norm) [10].\nThe problem of probabilistic low rank matrix completion over a \ufb01nite alphabet has received much\nless attention; see [22, 8, 6] among others. To the best of our knowledge, only the binary case\n(also referred to as the 1-bit matrix completion problem) has been covered in depth. In [8], the\nauthors proposed to model the entries as Bernoulli random variables whose success rate depend\nupon the matrix to be recovered through a convex link function (logistic and probit functions being\nnatural examples). The estimated matrix is then obtained as a solution of a maximization of the\nlog-likelihood of the observations under an explicit low-rank constraint. Moreover, the sampling\nmodel proposed in [8] assumes that the entries are sampled uniformly at random. Unfortunately,\nthis condition is not totally realistic in recommender system applications: in such a context some\nusers are more active than others and some popular items are rated more frequently. Theoretically,\nan important issue is that the method from [8] requires the knowledge of an upper bound on the\nnuclear norm or on the rank of the unknown matrix.\nVariations on the 1-bit matrix completion was further considered in [6] where a max-norm (though\nthe name is similar, this is different from the sup-norm) constrained minimization is considered. The\nmethod of [6] allows more general non-uniform samplings but still requires an upper bound on the\nmax-norm of the unknown matrix.\nIn the present paper we consider a penalized maximum log-likelihood method, in which the log-\nlikelihood of the observations is penalized by the nuclear norm (i.e., we focus on the Lagrangian\nversion rather than on the constrained one). We \ufb01rst establish an upper bound of the Kullback-\nLeibler divergence between the true and the estimated distribution under general sampling distribu-\ntions; see Section 2 for details. One should note that our method only requires the knowledge of\nan upper bound on the maximum absolute value of the probabilities, and improves upon previous\nresults found in the literature.\nLast but not least, we propose an ef\ufb01cient implementation of our statistical procedure, which is\nadapted from the lifted coordinate descent algorithm recently introduced in [9, 14]. Unlike other\nmethods, this iterative algorithm is designed to solve the convex optimization and not (possibly non-\nconvex) approximated formulation as in [21]. It also has the bene\ufb01t that it does not need to perform\nfull/partial SVD (Singular Value Decomposition) at every iteration; see Section 3 for details.\n\nNotation\nDe\ufb01ne m1 \u2227 m2 := min(m1, m2) and m1 \u2228 m2 := max(m1, m2). We equip the set of m1 \u00d7 m2\nmatrices with real entries (denoted Rm1\u00d7m2) with the scalar product (cid:104)X|X(cid:48)(cid:105) := tr(X(cid:62)X(cid:48)). For a\ngiven matrix X \u2208 Rm1\u00d7m2 we write (cid:107)X(cid:107)\u221e := maxi,j |Xi,j| and, for q \u2265 1, we denote its Schatten\nq-norm by\n\n(cid:33)1/q\n\n(cid:32)m1\u2227m2(cid:88)\n\ni=1\n\n(cid:107)X(cid:107)\u03c3,q :=\n\n\u03c3i(X)q\n\n,\n\nwhere \u03c3i(X) are the singular values of X ordered in decreasing order (see [1] for more details on\nsuch norms). The operator norm of X is given by (cid:107)X(cid:107)\u03c3,\u221e := \u03c31(X). Consider two vectors of\np \u2212 1 matrices (X j)p\u22121\nk,l \u2265 0,\nj=1 X j\nX\n\nj=1 such that for any (k, l) \u2208 [m1] \u00d7 [m2] we have X j\n(cid:48)j\nk,l \u2265 0. Their square Hellinger distance is\n\nk,l \u2265 0, 1 \u2212(cid:80)p\u22121\n\n(cid:48)j\n\n(cid:19)2\n\n+\n\n\uf8eb\uf8ed(cid:118)(cid:117)(cid:117)(cid:116)1 \u2212 p\u22121(cid:88)\n\n(cid:48)j\nk,l\n\nX\n\nk,l \u2212\nX j\n\n(cid:118)(cid:117)(cid:117)(cid:116)1 \u2212 p\u22121(cid:88)\n\n\uf8f6\uf8f82\uf8f9\uf8fa\uf8fb\n\n(cid:48)j\nk,l\n\nX\n\nj=1\n\nj=1\n\nH (X, X(cid:48)) :=\nd2\n\n1\n\nm1m2\n\nk,l \u2265 0 and 1 \u2212(cid:80)p\u22121\nj=1 and (X(cid:48)j)p\u22121\n\uf8ee\uf8ef\uf8f0p\u22121(cid:88)\n(cid:18)(cid:113)\nk,l \u2212(cid:113)\n(cid:88)\n\nj=1 X\n\nX j\n\nk\u2208[m1]\nl\u2208[m2]\n\nj=1\n\n2\n\n\fand their Kullback-Leibler divergence is\n\nKL (X, X(cid:48)) :=\n\n1\n\nm1m2\n\nX j\n\nk,l log\n\n\uf8ee\uf8f0p\u22121(cid:88)\n\nj=1\n\n(cid:88)\n\nk\u2208[m1]\nl\u2208[m2]\n\n+ (1 \u2212 p\u22121(cid:88)\n\nj=1\n\nX j\nk,l\n(cid:48)j\nk,l\n\nX\n\nX j\n\nk,l) log\n\n1 \u2212(cid:80)p\u22121\n1 \u2212(cid:80)p\u22121\n\nj=1 X j\nk,l\n(cid:48)j\nj=1 X\nk,l\n\n\uf8f9\uf8fb .\n\nit satis\ufb01es f j(x) \u2265 0 for j \u2208 [p \u2212 1] and 1 \u2212(cid:80)p\u22121\n\nGiven an integer p > 1, a function f : Rp\u22121 \u2192 Rp\u22121 is called a p-link function if for any x \u2208 Rp\u22121\nj=1 f j(x) \u2265 0. For any collection of p \u2212 1 matrices\nk,l) for any\n\nj=1, f (X) denotes the vector of matrices (f (X)j)p\u22121\n\nj=1 such that f (X)j\n\nk,l = f (X j\n\n(X j)p\u22121\n(k, l) \u2208 [m1] \u00d7 [m2] and j \u2208 [p \u2212 1].\n\n2 Main results\n\nj=1 of Rm1\u00d7m2 and an index \u03c9 \u2208 [m1] \u00d7 [m2], we denote by X\u03c9 the vector (X j\n\nLet p denote the cardinality of our \ufb01nite alphabet, that is the number of classes of the logistic model\n(e.g., ratings have p possible values or surveys p possible answers). For a vector of p \u2212 1 matrices\nX = (X j)p\u22121\n\u03c9)p\u22121\nj=1.\nWe consider an i.i.d. sequence (\u03c9i)1\u2264i\u2264n over [m1]\u00d7 [m2], with a probability distribution function\n\u03a0 that controls the way the matrix entries are revealed.\nIt is customary to consider the simple\nuniform sampling distribution over the set [m1] \u00d7 [m2], though more general sampling schemes\ncould be considered as well. We observe n independent random elements (Yi)1\u2264i\u2264n \u2208 [p]n. The\nobservations (Y1, . . . , Yn) are assumed to be independent and to follow a multinomial distribution\nwith success probabilities given by\n\nP(Yi = j) = f j( \u00afX 1\n\n\u03c9i\n\n, . . . , \u00afX p\u22121\n\n\u03c9i\n\n)\n\nj \u2208 [p \u2212 1]\n\nand P(Yi = p) = 1 \u2212 p\u22121(cid:88)\n\nP(Yi = j)\n\nj=1\n\nj=1 is a p-link function and \u00afX = ( \u00afX j)p\u22121\n\nwhere {f j}p\u22121\nj=1 is the vector of true (unknown) parameters\nwe aim at recovering. For ease of notation, we often write \u00afXi instead of \u00afX\u03c9i. Let us denote by \u03a6Y\nthe (normalized) negative log-likelihood of the observations:\n\nn(cid:88)\n\n\uf8ee\uf8f0p\u22121(cid:88)\n\ni=1\n\nj=1\n\n\u03a6Y(X) = \u2212 1\nn\n\n1{Yi=j} log(cid:0)f j(Xi)(cid:1) + 1{Yi=p} log\n\n\uf8f6\uf8f8\uf8f9\uf8fb ,\n\nf j(Xi)\n\n(1)\n\nFor any \u03b3 > 0 our proposed estimator is the following:\n\n\uf8eb\uf8ed1 \u2212 p\u22121(cid:88)\np\u22121(cid:88)\n\nj=1\n\nj=1\n\n\u02c6X =\n\narg min\n\nX\u2208(Rm1\u00d7m2 )p\u22121\n\nmaxj\u2208[p\u22121] (cid:107)X j(cid:107)\u221e\u2264\u03b3\n\n\u03a6\u03bb\n\nY (X) , where \u03a6\u03bb\n\nY (X) = \u03a6Y(X) + \u03bb\n\n(cid:107)X j(cid:107)\u03c3,1 ,\n\n(2)\n\nwith \u03bb > 0 being a regularization parameter controlling the rank of the estimator. In the rest of the\npaper we assume that the negative log-likelihood \u03a6Y is convex (this is the case for the multinomial\nlogit function, see for instance [3]).\nIn this section we present two results controlling the estimation error of \u02c6X in the binomial setting\n(i.e., when p = 2). Before doing so, let us introduce some additional notation and assumptions. The\nscore function (de\ufb01ned as the gradient of the negative log-likelihood) taken at the true parameter \u00afX,\nis denoted by \u00af\u03a3 := \u2207 \u03a6Y( \u00afX). We also need the following constants depending on the link function\nf and \u03b3 > 0:\n\n(cid:32)\n\nM\u03b3 = sup\n|x|\u2264\u03b3\n\nL\u03b3 = max\n\nK\u03b3 = inf\n|x|\u2264\u03b3\n\n2| log(f (x))| ,\n\nsup\n|x|\u2264\u03b3\n\n|f(cid:48)(x)|\nf (x)\nf(cid:48)(x)2\n\n8f (x)(1 \u2212 f (x))\n\n.\n\n(cid:33)\n\n,\n\n|f(cid:48)(x)|\n1 \u2212 f (x)\n\n, sup\n|x|\u2264\u03b3\n\n3\n\n\fIn our framework, we allow for a general distribution for observing the coef\ufb01cients. However, we\nneed to control deviations of the sampling mechanism from the uniform distribution and therefore\nwe consider the following assumptions.\nH1. There exists a constant \u00b5 \u2265 1 such that for all indexes (k, l) \u2208 [m1] \u00d7 [m2]\n\nl=1 \u03c0k,l) for any l \u2208 [m2] (resp. k \u2208 [m1]) the\n\n(\u03c0k,l) \u2265 1/(\u00b5m1m2) .\n\nmin\nk,l\n\nwith \u03c0k,l := \u03a0(\u03c91 = (k, l)).\n\nLet us de\ufb01ne Cl := (cid:80)m1\n\nk=1 \u03c0k,l (resp. Rk := (cid:80)m2\n\nprobability of sampling a coef\ufb01cient in column l (resp. in row k).\nH2. There exists a constant \u03bd \u2265 1 such that\n\n(Rk, Cl) \u2264 \u03bd/(m1 \u2227 m2) ,\n\nmax\nk,l\n\nAssumption H1 ensures that each coef\ufb01cient has a non-zero probability of being sampled whereas\nH2 requires that no column nor row is sampled with too high probability (see also [11, 16] for more\ndetails on this condition).\ni=1 by\nWe de\ufb01ne the sequence of matrices (Ei)n\nEi := eki(e(cid:48)\n(e(cid:48)\nl)m2\nl=1) being the canonical ba-\nsis of Rm1 (resp. Rm2). Furthermore, if (\u03b5i)1\u2264i\u2264n is a Rademacher sequence independent from\n(\u03c9i)n\n\n)(cid:62) where (ki, li) = \u03c9i and with (ek)m1\n\ni=1 associated to the revealed coef\ufb01cient (\u03c9i)n\n\ni=1 and (Yi)1\u2264i\u2264n we de\ufb01ne\n\nk=1 (resp.\n\nli\n\nn(cid:88)\n\ni=1\n\n\u03a3R :=\n\n1\nn\n\n\u03b5iEi .\n\nWe can now state our \ufb01rst result. For completeness, the proofs can be found in the supplementary\nmaterial.\nTheorem 1. Assume H1 holds, \u03bb \u2265 2(cid:107) \u00af\u03a3(cid:107)\u03c3,\u221e and (cid:107) \u00afX(cid:107)\u221e \u2264 \u03b3. Then, with probability at least\n1 \u2212 2/d the Kullback-Leibler divergence between the true and estimated distribution is bounded by\n\nm1m2 rank( \u00afX)(cid:0)\u03bb2 + c\u2217L2\n\n\u03b3(E(cid:107)\u03a3R(cid:107)\u03c3,\u221e)2(cid:1) , \u00b5eM\u03b3\n\n(cid:16)\n\nKL\n\nf ( \u00afX), f ( \u02c6X)\n\n(cid:32)\n\n(cid:17) \u2264 8 max\n\n\u00b52\nK\u03b3\n\n(cid:112)log(d)\n\n(cid:33)\n\n,\n\nn\n\nwhere c\u2217 is a universal constant.\nNote that (cid:107) \u00af\u03a3(cid:107)\u03c3,\u221e is stochastic and that its expectation E(cid:107)\u03a3R(cid:107)\u03c3,\u221e is unknown. However, thanks to\nAssumption H2 these quantities can be controlled.\nTo ease notation let us also de\ufb01ne m := m1 \u2227 m2, M := m1 \u2228 m2 and d := m1 + m2.\nTheorem 2. Assume H 1 and H 2 hold and that (cid:107) \u00afX(cid:107)\u221e \u2264 \u03b3. Assume in addition that n \u2265\n2m log(d)/(9\u03bd). Taking \u03bb = 6L\u03b3\nfolllowing holds\n(cid:107) \u00afX \u2212 \u02c6X(cid:107)2\nm1m2\n\n(cid:112)2\u03bd log(d)/(mn), then with probability at least 1 \u2212 3/d the\n(cid:33)\n(cid:112)log(d)\n(cid:17) \u2264 max\n\nM rank( \u00afX) log(d)\n\nf ( \u00afX), f ( \u02c6X)\n\n\u2264 KL\n\n, 8\u00b5eM\u03b3\n\n\u03bd\u00b52L2\n\u03b3\n\n(cid:32)\n\n(cid:16)\n\nK\u03b3\n\nK\u03b3\n\n\u03c3,2\n\nn\n\n,\n\nn\n\n\u00afc\n\nwhere \u00afc is a universal constant.\nRemark. Let us compare the rate of convergence of Theorem 2 with those obtained in previous\nworks on 1-bit matrix completion. In [8], the parameter \u00afX is estimated by minimizing the negative\nlog-likelihood under the constraints (cid:107)X(cid:107)\u221e \u2264 \u03b3 and (cid:107)X(cid:107)\u03c3,1 \u2264 \u03b3\nrm1m2 for some r > 0. Under\nthe assumption that rank( \u00afX) \u2264 r, they could prove that\n\n\u221a\n\n(cid:114)\n\n(cid:107) \u00afX \u2212 \u02c6X(cid:107)2\nm1m2\n\n\u03c3,2\n\n\u2264 C\u03b3\n\nrd\nn\n\n,\n\nwhere C\u03b3 is a constant depending on \u03b3 (see [8, Theorem 1]). This rate of convergence is slower\nthan the rate of convergence given by Theorem 2. [6] studied a max-norm constrained maximum\nlikelihood estimate and obtained a rate of convergence similar to [8].\n\n4\n\n\f3 Numerical Experiments\n\nImplementation\nFor numerical experiments, data were simulated according to a multinomial\nlogit distribution. In this setting, an observation Yk,l associated to row k and column l is distributed\nas P(Yk,l = j) = f j(X 1\n\nk,l, . . . , X p\u22121\n\nk,l ) where\n\nf j(x1, . . . , xp\u22121) = exp(xj)\n\nexp(xj)\n\nfor j \u2208 [p \u2212 1] .\n\n,\n\n(3)\n\n\uf8eb\uf8ed1 +\n\np\u22121(cid:88)\n\nj=1\n\n\uf8f6\uf8f8\u22121\n\nWith this choice, \u03a6Y is convex and problem (2) can be solved using convex optimization algo-\nrithms. Moreover, following the advice of [8] we considered the unconstrained version of problem\n(2) (i.e., with no constraint on (cid:107)X(cid:107)\u221e), which reduces signi\ufb01cantly the computation burden and has\nno signi\ufb01cant impact on the solution in practice. To solve this problem, we have extended to the\nmultinomial case the coordinate gradient descent algorithm introduced by [9]. This type of algo-\nrithm has the advantage, say over the Soft-Impute [19] or the SVT [4] algorithm, that it does not\nrequire the computation of a full SVD at each step of the main loop of an iterative (proximal) algo-\nrithm (bare in mind that the proximal operator associated to the nuclear norm is the soft-thresholding\noperator of the singular values). The proposed version only computes the largest singular vectors\nand singular values. This potentially decreases the computation by a factor close to the value of the\nupper bound on the rank commonly used (see the aforementioned paper for more details).\nLet us present the algorithm. Any vector of p\u2212 1 matrices X = (X j)p\u22121\nof the tensor product space Rm1\u00d7m2 \u2297 Rp\u22121 and denoted by:\n\nj=1 is identi\ufb01ed as an element\n\nX =\n\nX j \u2297 ej ,\n\n(4)\n\nwhere again (ej)p\u22121\nnormalized rank-one matrices is denoted by\n\nj=1 is the canonical basis on Rp\u22121 and \u2297 stands for the tensor product. The set of\n\nM :=(cid:8)M \u2208 Rm1\u00d7m2|M = uv(cid:62) | (cid:107)u(cid:107) = (cid:107)v(cid:107) = 1, u \u2208 Rm1 , v \u2208 Rm2(cid:9) .\nfor a \ufb01nite number of M \u2208 M. This space is equipped with the (cid:96)1-norm (cid:107)\u03b8(cid:107)1 =(cid:80)\n\nDe\ufb01ne \u0398 the linear space of real-valued functions on M with \ufb01nite support, i.e., \u03b8(M ) = 0 except\nM\u2208M |\u03b8(M )|.\nthe cone of functions \u03b8 \u2208 \u0398 such that \u03b8(M ) \u2265 0 for all\n\nDe\ufb01ne by \u0398+ the positive orthant, i.e.,\nM \u2208 M. Any tensor X can be associated with a vector \u03b8 = (\u03b81, . . . , \u03b8p\u22121) \u2208 \u0398p\u22121\n\n+ , i.e.,\n\nX =\n\n\u03b8j(M )M \u2297 ej .\n\n(5)\n\nSuch representations are not unique, and among them, the one associated to the SVD plays a key\nrole, as we will see below. For a given X represented by (4) and for any j \u2208 {1, . . . , p \u2212 1}, denote\nby {\u03c3j\nk=1 the associated singular\nvectors. Then, X may be expressed as\n\nk=1 the (non-zero) singular values of the matrix X j and {uj\n\nk}nj\n\nk}nj\n\nk,vj\n\np\u22121(cid:88)\n\nj=1\n\np\u22121(cid:88)\n\n(cid:88)\n\nj=1\n\nM\u2208M\n\np\u22121(cid:88)\n\nnj(cid:88)\n\nj=1\n\nk=1\n\n\u03c3j\nkuj\n\nk)(cid:62) \u2297 ej .\nk(vj\nk)(cid:62), k \u2208 [nj] and \u03b8j(M ) = 0 otherwise, one\n\n(6)\n\nX =\n\np\u22121(cid:88)\n\nDe\ufb01ning \u03b8j the function \u03b8j(M ) = \u03c3j\nk if M = uj\nobtains a representation of the type given in Eq. (5).\nConversely, for any \u03b8 = (\u03b81, . . . , \u03b8p\u22121) \u2208 \u0398p\u22121, de\ufb01ne the map\n\nk(vj\n\nW : \u03b8 \u2192 W\u03b8 :=\n\nand the auxiliary objective function\n\nj=1\n\n(cid:88)\n\nM\u2208M\n\n\u03b8j(M )M\n\n\u03b8 :=\n\n\u03b8 \u2297 ej with W j\nW j\n(cid:88)\np\u22121(cid:88)\n\nj=1\n\nM\u2208M\n\n5\n\n\u02dc\u03a6\u03bb\n\nY (\u03b8) = \u03bb\n\n\u03b8j(M ) + \u03a6Y(W\u03b8) .\n\n(7)\n\n\f+\n\nj=1\n\nThe map \u03b8 (cid:55)\u2192 W\u03b8 is a continuous linear map from (\u0398p\u22121,(cid:107) \u00b7 (cid:107)1) to Rm1\u00d7m2 \u2297 Rp\u22121, where\n\n(cid:107)\u03b8(cid:107)1 =(cid:80)p\u22121\nand one obtains (cid:107)\u03b8(cid:107)1 =(cid:80)p\u22121\n\n(cid:80)\nM\u2208M |\u03b8j(M )|. In addition, for all \u03b8 \u2208 \u0398p\u22121\n\u03b8 (cid:107)\u03c3,1 \u2264 (cid:107)\u03b8(cid:107)1 ,\n\np\u22121(cid:88)\n\u03b8 (cid:107)\u03c3,1 when \u03b8 is the representation associated to the SVD decom-\nposition. An important consequence, outlined in [9, Proposition 3.1], is that the minimization of (7)\nis actually equivalent to the minimization of (2); see [9, Theorem 3.2].\nThe proposed coordinate gradient descent algorithm updates at each step the nonnegative \ufb01nite sup-\nport function \u03b8. For \u03b8 \u2208 \u0398 we denote by supp(\u03b8) the support of \u03b8 and for M \u2208 M, by \u03b4M \u2208 \u0398 the\nDirac function on M satisfying \u03b4M (M ) = 1 and \u03b4M (M(cid:48)) = 0 if M(cid:48) (cid:54)= M. In our experiments we\nhave set to zero the initial \u03b80.\n\nj=1 (cid:107)W j\n\n(cid:107)W j\n\nj=1\n\nAlgorithm 1: Multinomial lifted coordinate gradient descent\nData: Observations: Y , tuning parameter \u03bb\ninitial parameter: \u03b80 \u2208 \u0398p\u22121\nResult: \u03b8 \u2208 \u0398p\u22121\nInitialization: \u03b8 \u2190 \u03b80, k \u2190 0\nwhile k \u2264 K do\n\n+\n\nfor j = 0 to p \u2212 1 do\n\nCompute top singular vectors pair of (\u2212\u2207 \u03a6Y(W\u03b8))j: uj, vj\n\n+ ; tolerance: \u0001; maximum number of iterations: K\n\nLet g = \u03bb + minj=1,...,p\u22121(cid:104)\u2207 \u03a6Y | uj(vj)(cid:62)(cid:105)\nif g \u2264 \u2212\u0001/2 then\n\n(cid:0)\u03b8 + (b0\u03b4u0(v0)(cid:62) , . . . , bp\u22121\u03b4up\u22121(vp\u22121)(cid:62) )(cid:1)\n\nelse\n\n\u02dc\u03a6\u03bb\nY\n\n+\n\narg min\n\n(b0,...,bp\u22121)\u2208Rp\u22121\n\n(\u03b20, . . . , \u03b2p\u22121) =\n\u03b8 \u2190 \u03b8 + (\u03b20\u03b4u0(v0)(cid:62), . . . , \u03b2p\u22121\u03b4up\u22121(vp\u22121)(cid:62) )\nk \u2190 k + 1\nLet gmax = maxj\u2208[p\u22121] maxuj (vj )(cid:62)\u2208supp(\u03b8j ) |\u03bb + (cid:104)\u2207 \u03a6Y | uj(vj)(cid:62)(cid:105)|\nif gmax \u2264 \u0001 then\nelse\n\nbreak\n\u03b8 \u2190\n\u03b8(cid:48)\u2208\u0398p\u22121\nk \u2190 k + 1\n\narg min\n\n+ ,supp(\u03b8(cid:48)j )\u2282supp(\u03b8j ),j\u2208[p\u22121]\n\nY (\u03b8(cid:48))\n\u02dc\u03a6\u03bb\n\nA major interest of Algorithm 1 is that it requires to store the value of the parameter entries only\nfor the indexes which are actually observed. Since in practice the number of observations is much\nsmaller than the total number of coef\ufb01cients m1m2, this algorithm is both memory and computa-\ntionally ef\ufb01cient. Moreover, using an SVD algorithm such as Arnoldi iterations to compute the top\nsingular values and vector pairs (see [12, Section 10.5] for instance) allows us to take full advantage\nof gradient sparse structure. Algorithm 1 was implemented in C and Table 1 gives a rough idea\nof the execution time for the case of two classes on a 3.07Ghz w3550 Xeon CPU (RAM 1.66 Go,\nCache 8Mo).\nSimulated experiments\nTo evaluate our procedure we have performed simulations for matri-\nces with p = 2 or 5. For each class matrix X j we sampled uniformly \ufb01ve unitary vector pairs\n(uj\n\nk=1. We have then generated matrices of rank equals to 5, such that\n\nk, vj\n\nk)5\n\n\u221a\nX j = \u0393\n\nm1m2\n\n\u03b1kuj\n\nk(vj\n\nk)(cid:62) ,\n\n\u221a\nwith (\u03b11, . . . , \u03b15) = (2, 1, 0.5, 0.25, 0.1) and \u0393 is a scaling factor. The\nthat E[(cid:107)X j(cid:107)\u221e] does not depend on the sizes of the problem m1 and m2.\n\nk=1\n\nm1m2 factor, guarantees\n\n5(cid:88)\n\n6\n\n\fParameter Size\nObservations\nExecution Time (s.)\n\n103 \u00d7 103\n\n3 \u00b7 103 \u00d7 3 \u00b7 103\n\n104 \u00d7 104\n\n105\n4.5\n\n105\n52\n\n107\n730\n\nTable 1: Execution time of the proposed algorithm for the binary case.\n\nWe then sampled the entries uniformly and the observations according to a logit distribution given\nby Eq. (3). We have then considered and compared the two following estimators both computed\nusing Algorithm 1:\n\n\u2022 the logit version of our method (with the link function given by Eq. (3))\n\u2022 the Gaussian completion method (denoted by \u02c6XN ), that consists in using the Gaussian\nlog-likelihood instead of the multinomial in (2), i.e., using a classical squared Frobenius\nnorm (the implementation being adapted mutatis mutandis). Moreover an estimation of the\nstandard deviation is obtained by the classical analysis of the residue.\n\nContrary to the logit version, the Gaussian matrix completion does not directly recover the proba-\nbilities of observing a rating. However, we can estimate this probability by the following quantity:\n\nP( \u02c6XN\n\nk,l = j) = FN (0,1)(pj+1) \u2212 FN (0,1)(pj) with pj =\n\nj\u22120.5\u2212 \u02c6XN\n\nk,l\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f30\n\n1\n\n\u02c6\u03c3\n\nif j = 1 ,\nif 0 < j < p\nif j = p ,\n\nwhere FN (0,1) is the cdf of a zero-mean standard Gaussian random variable.\nAs we see on Figure 1, the logistic estimator outperforms the Gaussian for both cases p = 2 and\np = 5 in terms of the Kullback-Leibler divergence. This was expected because the Gaussian model\nallows uniquely symmetric distributions with the same variance for all the ratings, which is not the\ncase for logistic distributions. The choice of the \u03bb parameter has been set for both methods by\nperforming 5-fold cross-validation on a geometric grid of size 0.8 log(n).\nTable 2 and Table 3 summarize the results obtained for a 900 \u00d7 1350 matrix respectively for p = 2\nand p = 5. For both the binomial case p = 2 and the multinomial case p = 5, the logistic model\nslightly outperforms the Gaussian model. This is partly due to the fact that in the multinomial case,\nsome ratings can have a multi-modal distribution. In such a case, the Gaussian model is unable\nto predict these ratings, because its distribution is necessarily centered around a single value and\nis not \ufb02exible enough. For instance consider the case of a rating distribution with high probability\nof seeing 1 or 5, low probability of getting 2, 3 and 4, where we observed both 1\u2019s and 5\u2019s. The\nestimator based on a Gaussian model will tend to center its distribution around 2.5 and therefore\nmisses the bimodal shape of the distribution.\n\nObservations\nGaussian prediction error\nLogistic prediction error\n\n10 \u00b7 103\n0.49\n0.42\n\n50 \u00b7 103\n0.34\n0.30\n\n100 \u00b7 103\n\n500 \u00b7 103\n\n0.29\n0.27\n\n0.26\n0.24\n\nTable 2: Prediction errors for a binomial (2 classes) underlying model, for a 900 \u00d7 1350 matrix.\n\nObservations\nGaussian prediction error\nLogistic prediction error\n\n10 \u00b7 103\n0.78\n0.75\n\n50 \u00b7 103\n0.76\n0.54\n\n100 \u00b7 103\n\n500 \u00b7 103\n\n0.73\n0.47\n\n0.69\n0.43\n\nTable 3: Prediction Error for a multinomial (5 classes) distribution against a 900 \u00d7 1350 matrix.\n\nReal dataset We have also run the same estimators on the MovieLens 100k dataset. In the case\nof real data we cannot calculate the Kullback-Leibler divergence since no ground truth is available.\nTherefore, to compare the prediction errors, we randomly selected 20% of the entries as a test set,\nand the remaining entries were split between a training set (80%) and a validation set (20%).\n\n7\n\n\fFigure 1: Kullback-Leibler divergence between the estimated and the true model for different\nmatrices sizes and sampling fraction, normalized by number of classes. Right \ufb01gure: binomial\nand Gaussian models ; left \ufb01gure: multinomial with \ufb01ve classes and Gaussian model. Results are\naveraged over \ufb01ve samples.\n\nFor this dataset, ratings range from 1 to 5. To consider the bene\ufb01t of a binomial model, we have\ntested each rating against the others (e.g., ratings 5 are set to 0 and all others are set to 1). Interest-\ningly we see that the Gaussian prediction error is signi\ufb01cantly better when choosing labels \u22121, 1\ninstead of labels 0, 1. This is another motivation for not using the Gaussian version: the sensibility to\nthe alphabet choice seems to be crucial for the Gaussian version, whereas the binomial/multinomial\nones are insensitive to it. These results are summarized in table 4.\n\nRating\nGaussian prediction error (labels \u22121 and 1)\nGaussian prediction error (labels 0 and 1)\nLogistic prediction error\n\n1\n\n0.06\n0.12\n0.06\n\n2\n\n0.12\n0.20\n0.11\n\n3\n\n0.28\n0.39\n0.27\n\n4\n\n0.35\n0.46\n0.34\n\n5\n\n0.19\n0.30\n0.20\n\nTable 4: Binomial prediction error when performing one versus the others procedure on the Movie-\nLens 100k dataset.\n\n4 Conclusion and future work\n\nWe have proposed a new nuclear norm penalized maximum log-likelihood estimator and have pro-\nvided strong theoretical guarantees on its estimation accuracy in the binary case. Compared to\nprevious works on 1-bit matrix completion, our method has some important advantages. First, it\nworks under quite mild assumptions on the sampling distribution. Second, it requires only an up-\nper bound on the maximal absolute value of the unknown matrix. Finally, the rates of convergence\ngiven by Theorem 2 are faster than the rates of convergence obtained in [8] and [6]. In future work,\nwe could consider the extension to more general data \ufb01tting terms, and to possibly generalize the\nresults to tensor formulations, or to penalize directly the nuclear norm of the matrix probabilities\nthemselves.\n\nAcknowledgments\n\nJean Lafond is grateful for fundings from the Direction G\u00b4en\u00b4erale de l\u2019Armement (DGA) and to the\nlabex LMH through the grant no ANR-11-LABX-0056-LMH in the framework of the \u201dProgramme\ndes Investissements d\u2019Avenir\u201d. Joseph Salmon acknowledges Chair Machine Learning for Big Data\nfor partial \ufb01nancial support. The authors would also like to thank Alexandre Gramfort for helpful\ndiscussions.\n\n8\n\n1000002000003000004000005000000.000.050.100.150.20size: 100x150size: 300x450size: 900x1350Normalized KL divergence for logistic (plain), Gaussian (dashed)Mean KL divergence Number of observationsNormalized KL divergence for logistic (plain), Gaussian (dashed)Mean KL divergence Number of observations1000002000003000004000005000000.000.050.100.150.20size: 100x150size: 300x450size: 900x1350\fReferences\n[1] R. Bhatia. Matrix analysis, volume 169 of Graduate Texts in Mathematics. Springer-Verlag,\n\nNew York, 1997.\n\n[2] J. Bobadilla, F. Ortega, A. Hernando, and A. Guti\u00b4errez. Recommender systems survey.\n\nKnowledge-Based Systems, 46(0):109 \u2013 132, 2013.\n\n[3] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge,\n\n2004.\n\n[4] J-F. Cai, E. J. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix com-\n\npletion. SIAM Journal on Optimization, 20(4):1956\u20131982, 2010.\n\n[5] T. T. Cai and W-X. Zhou. Matrix completion via max-norm constrained optimization. CoRR,\n\nabs/1303.0341, 2013.\n\n[6] T. T. Cai and W-X. Zhou. A max-norm constrained minimization approach to 1-bit matrix\n\ncompletion. J. Mach. Learn. Res., 14:3619\u20133647, 2013.\n\n[7] E. J. Cand`es and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925\u2013\n\n936, 2010.\n\n[8] M. A. Davenport, Y. Plan, E. van den Berg, and M. Wootters. 1-bit matrix completion. CoRR,\n\nabs/1209.3672, 2012.\n\n[9] M. Dud\u00b4\u0131k, Z. Harchaoui, and J. Malick. Lifted coordinate descent for learning with trace-norm\n\nregularization. In AISTATS, 2012.\n\n[10] M. Fazel. Matrix rank minimization with applications. PhD thesis, Stanford University, 2002.\n[11] R. Foygel, R. Salakhutdinov, O. Shamir, and N. Srebro. Learning with the weighted trace-norm\n\nunder arbitrary sampling distributions. In NIPS, pages 2133\u20132141, 2011.\n\n[12] G. H. Golub and C. F. van Loan. Matrix computations.\n\nBaltimore, MD, fourth edition, 2013.\n\nJohns Hopkins University Press,\n\n[13] D. Gross. Recovering low-rank matrices from few coef\ufb01cients in any basis.\n\nTheory, IEEE Transactions on, 57(3):1548\u20131566, 2011.\n\nInformation\n\n[14] Z. Harchaoui, A. Juditsky, and A. Nemirovski. Conditional gradient algorithms for norm-\n\nregularized smooth convex optimization. Mathematical Programming, pages 1\u201338, 2014.\n\n[15] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. J. Mach.\n\nLearn. Res., 11:2057\u20132078, 2010.\n\n[16] O. Klopp. Noisy low-rank matrix completion with general sampling distribution. Bernoulli,\n\n2(1):282\u2013303, 02 2014.\n\n[17] V. Koltchinskii, A. B. Tsybakov, and K. Lounici. Nuclear-norm penalization and optimal rates\n\nfor noisy low-rank matrix completion. Ann. Statist., 39(5):2302\u20132329, 2011.\n\n[18] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.\n\nComputer, 42(8):30\u201337, 2009.\n\n[19] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning\n\nlarge incomplete matrices. J. Mach. Learn. Res., 11:2287\u20132322, 2010.\n\n[20] S. Negahban and M. J. Wainwright. Restricted strong convexity and weighted matrix comple-\n\ntion: optimal bounds with noise. J. Mach. Learn. Res., 13:1665\u20131697, 2012.\n\n[21] B. Recht and C. R\u00b4e. Parallel stochastic gradient algorithms for large-scale matrix completion.\n\nMathematical Programming Computation, 5(2):201\u2013226, 2013.\n\n[22] A. Todeschini, F. Caron, and M. Chavent. Probabilistic low-rank matrix completion with\n\nadaptive spectral regularization algorithms. In NIPS, pages 845\u2013853, 2013.\n\n9\n\n\f", "award": [], "sourceid": 905, "authors": [{"given_name": "Jean", "family_name": "Lafond", "institution": "T\u00e9l\u00e9com ParisTech"}, {"given_name": "Olga", "family_name": "Klopp", "institution": "Universit\u00e9 Paris Ouest"}, {"given_name": "Eric", "family_name": "Moulines", "institution": "Telecom ParisTech"}, {"given_name": "Joseph", "family_name": "Salmon", "institution": "T\u00e9l\u00e9com ParisTech"}]}