{"title": "Preference Completion from Partial Rankings", "book": "Advances in Neural Information Processing Systems", "page_first": 1370, "page_last": 1378, "abstract": "We propose a novel and efficient algorithm for the collaborative preference completion problem, which involves jointly estimating individualized rankings for a set of entities over a shared set of items, based on a limited number of observed affinity values. Our approach exploits the observation that while preferences are often recorded as numerical scores, the predictive quantity of interest is the underlying rankings. Thus, attempts to closely match the recorded scores may lead to overfitting and impair generalization performance. Instead, we propose an estimator that directly fits the underlying preference order, combined with nuclear norm constraints to encourage low--rank parameters. Besides (approximate) correctness of the ranking order, the proposed estimator makes no generative assumption on the numerical scores of the observations. One consequence is that the proposed estimator can fit any consistent partial ranking over a subset of the items represented as a directed acyclic graph (DAG), generalizing standard techniques that can only fit preference scores. Despite this generality, for supervision representing total or blockwise total orders, the computational complexity of our algorithm is within a $\\log$ factor of the standard algorithms for nuclear norm regularization based estimates for matrix completion. We further show promising empirical results for a novel and challenging application of collaboratively ranking of the associations between brain--regions and cognitive neuroscience terms.", "full_text": "Preference Completion from Partial Rankings\n\nSuriya Gunasekar\n\nOluwasanmi Koyejo\n\nUniversity of Texas, Austin, TX, USA\n\nUniversity of Illinois, Urbana-Champaign, IL, USA\n\nsuriya@utexas.edu\n\nsanmi@illinois.edu\n\nJoydeep Ghosh\n\nUniversity of Texas,Austin, TX, USA\n\nghosh@ece.utexas.edu\n\nAbstract\n\nWe propose a novel and ef\ufb01cient algorithm for the collaborative preference comple-\ntion problem, which involves jointly estimating individualized rankings for a set of\nentities over a shared set of items, based on a limited number of observed af\ufb01nity\nvalues. Our approach exploits the observation that while preferences are often\nrecorded as numerical scores, the predictive quantity of interest is the underlying\nrankings. Thus, attempts to closely match the recorded scores may lead to over-\n\ufb01tting and impair generalization performance. Instead, we propose an estimator\nthat directly \ufb01ts the underlying preference order, combined with nuclear norm\nconstraints to encourage low\u2013rank parameters. Besides (approximate) correctness\nof the ranking order, the proposed estimator makes no generative assumption on\nthe numerical scores of the observations. One consequence is that the proposed\nestimator can \ufb01t any consistent partial ranking over a subset of the items repre-\nsented as a directed acyclic graph (DAG), generalizing standard techniques that\ncan only \ufb01t preference scores. Despite this generality, for supervision representing\ntotal or blockwise total orders, the computational complexity of our algorithm is\nwithin a log factor of the standard algorithms for nuclear norm regularization based\nestimates for matrix completion. We further show promising empirical results for a\nnovel and challenging application of collaboratively ranking of the associations\nbetween brain\u2013regions and cognitive neuroscience terms.\n\n1\n\nIntroduction\n\nCollaborative preference completion is the task of jointly learning bipartite (or dyadic) preferences of\nset of entities for a shared list of items, e.g., user\u2013item interactions in a recommender system [14; 22].\nIt is commonly assumed that such entity\u2013item preferences are generated from a small number of\nlatent or hidden factors, or equivalently, the underlying preference value matrix is assumed to be\nlow rank. Further, if the observed af\ufb01nity scores from various explicit and implicit feedback are\ntreated as exact (or mildly perturbed) entries of the unobserved preference value matrix, then the\npreference completion task naturally \ufb01ts in the framework of low rank matrix completion [22; 38].\nMore generally, low rank matrix completion involves predicting the missing entries of a low rank\nmatrix from a vanishing fraction of its entries observed through a noisy channel. Several low rank\nmatrix completion estimators and algorithms have been developed in the literature, many with strong\ntheoretical guarantees and empirical performance [6; 32; 21; 28; 38; 10].\nRecent research in the preference completion literature have noted that using a matrix completion\nestimator for collaborative preference estimation may be misguided [11; 33; 23] as the observed\nentity\u2013item af\ufb01nity scores from implicit/explicit feedback are potentially subject to systematic\nmonotonic transformations arising from limitations in feedback collection, e.g., quantization and\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\finherent biases. While simple user biases and linear transofmations can be handled within a low\nrank matrix framework, more complex transformations like quantization can potentially increase\nthe rank of the observed preference score matrix signi\ufb01cantly, thus adversely affecting recovery\nusing standard low rank matrix completion [13]. Further, despite the common practice of measuring\npreferences using numerical scores, predictions are most often deployed or evaluated based on\nthe item ranking e.g.\nin recommender systems, user recommendations are often presented as a\nranked list of items without the underlying scores. Indeed several authors have shown that favorable\nempirical/theoretical performance in mean square error for the preference matrix often does not\ntranslate to better performance when performance is measured using ranking metrics [11; 33; 23].\nThus, collaborative preference estimation may be better posed as a collection of coupled learning\nto rank (LETOR) problems [25], where we seek to jointly learn the preference rankings of a set of\nentities, by exploiting the low dimensional latent structure of the underlying preference values.\nThis paper considers preference completion in a general collaborative LETOR setting. Importantly,\nwhile the observations are assumed to be reliable indicators for relative preference ranking, their\nnumerical scores may be quite deviant from the ground truth low rank preference matrix. Therefore,\nwe aim at addressing preference completion under the following generalizations:\n1. In a simple setting, for each entity, a score vector representing the its observed af\ufb01nity interactions\nis assumed to be generated from an arbitrary monotonic transformation of the corresponding\nentries of the ground truth preference matrix. We make no further generative assumptions on\nobserved scores beyond monotonicity with respect to the underlying low rank preference matrix.\n2. We also consider a more general setting, where observed preferences of each entity represent\nspeci\ufb01cations of a partial ranking in the form of a directed acyclic graph (DAG) \u2013 the nodes\nrepresent a subset of items, and each edge represents a strict ordering between a pair of nodes.\nSuch rankings may be encountered when the preference scores are consolidated from multiple\nsources of feedback, e.g., comparative feedback (pairwise or listwise) solicited for independent\nsubsets of items. This generalized setting cannot be handled by standard matrix completion\nwithout some way of transforming the DAG orderings into a score vector.\n\nOur work is in part motivated by an application to neuroimaging meta-analysis as outlined in the\nfollowing. Cognitive neuroscience aims to quantify the link between brain function with behavior.\nThis interaction is most often measured in humans using Functional Magnetic Resonance Imaging\n(fMRI) experiments that measure brain activity in response to behavioral tasks. After analysis,\nthe conclusions are often summarized in neuroscience publications which include a table of brain\nlocations that are most actively activated in response to an experimental stimulus. These results\ncan then be synthesized using meta-analysis techniques to derive accurate predictions of brain\nactivity associated with cognitive terms (also known as forward inference) and prediction of cognitive\nterms associated with brain regions (also known as reverse inference). For our study, we used data\nfrom neurosynth [36] - a public repository1 which automatically scrapes information on published\nassociations between brain regions and terms in cognitive neuroscience experiments.\nThe key contributions of the paper are summarized below.\n\u2022 We propose a convex estimator for low rank preference completion using limited supervision,\naddressing: (a) arbitrary monotonic transformations of preference scores; and (b) partial rankings\nover items (Section 3.1). We derive generalization error bounds for a surrogate ranking loss that\nquanti\ufb01es the trade\u2013off between data\u2013\ufb01t and regularization (Section 5).\n\u2022 We propose ef\ufb01cient algorithms for the estimate under total and partially ordered observations.\nIn the case of total orders, in spite of increased generality, the computational complexity of our\nalgorithm is within a log factor of the standard convex algorithms for matrix completion (Section 4).\n\u2022 The proposed algorithm is evaluated for a novel application of identifying associations between\nbrain\u2013regions and cognitive terms from the neurosynth dataset [37] (Section 6). Such a large scale\nmeta-analysis synthesizing information from the literature and related tasks has the potential to\nlead to novel insights into the role of brain regions in cognition and behavior.\n\n1.1 Notation\nFor a matrix M \u2208 Rd1\u00d7d2, let \u03c31 \u2265 \u03c32 \u2265 . . . be singular values of M. Then, nuclear norm\ni . Let\n\ni \u03c3i, operator norm (cid:107)M(cid:107)op = \u03c31, and Frobenius norm (cid:107)M(cid:107)F = (cid:112)(cid:80)\n\ni \u03c32\n\n(cid:107)M(cid:107)\u2217 = (cid:80)\n\n1http://neurosynth.org/\n\n2\n\n\f[N ] = {1, 2, . . . , N}. A vector or a set x indexed by j \u2208 [N ] is sometimes denoted as (xj)N\nj=1 or\nsimply (xj) whenever N is unambiguous. Let \u2126 \u2282 [d1] \u00d7 [d2] denote a subset of indices of a matrix\nin Rd1\u00d7d2. For j \u2208 [d2], let \u2126j = {(i(cid:48), j(cid:48)) \u2208 \u2126 : j(cid:48) = j} \u2282 \u2126 denotes the subset of entries in \u2126\n|\u2126|\nfrom the jth column. Given \u2126 = {(is, js) : s = 1, 2, . . . ,|\u2126|}, P\u2126 : X \u2192 (Xisjs )\ns=1 \u2208 R|\u2126| is the\nlinear subsampling operator, and P\u2217\n\u2126(y)(cid:105).\nFor conciseness, we sometimes use the notation X\u2126 to denote P\u2126(X).\n\n\u2126 : R|\u2126| \u2192 Rd1\u00d7d2 is its adjoint, i.e (cid:104)y,P\u2126(X)(cid:105) = (cid:104)X,P\u2217\n\n2 Related Work\n\nMatrix Completion: Low rank matrix completion has an extensive literature; a few examples\ninclude [22; 6; 21; 28] among several others. However, the bulk of these works including those in the\ncontext of ranking/recommendation applications focus on (a) \ufb01tting the observed numerical scores\nusing squared loss, and (b) evaluating the results on parameter/rating recovery metrics such as root\nmean squared error (RMSE). The shortcomings of such estimators and results using squared loss in\nranking applications have been studied in some recent research [12; 11]. Motivated by collaborative\nranking applications, there has been growing interest in addressing matrix completion within an\nexplicit LETOR framework. Weimer et al. [35] and Koyejo et al. [23] propose estimators that involve\nnon\u2013convex optimization problems and their algorithmic convergence and generalization behavior are\nnot well understood. Some recent works provide parameter recovery guarantees for pairwise/listwise\nranking observations under speci\ufb01c probabilistic distributional assumptions on the observed rankings\n[31; 26; 29]. In comparison, the estimators and algorithms in this paper are agnostic to the generative\ndistribution, and hence have much wider applicability.\nLearning to rank (LETOR): LETOR is a structured prediction task of rank ordering relevance of a\nlist of items as a function of pre\u2013selected features [25]. Currently, leading algorithms for LETOR\nare listwise methods [9] (as is the approach taken in this paper), which fully exploit the ranking\nstructure of ordered observations, and offer better modeling \ufb02exibility compared to the pointwise\n[24] and pairwise methods [16; 18]. A recent listwise LETOR algorithm proposed the idea of\nmonotone retargeting (MR) [2], which elegantly addresses listwise learning to rank (LETOR) task\nwhile maintaining the relative simplicity and scalability of pointwise estimation. MR was further\nextended to incorporate margins in the margin equipped monotonic retargeting (MEMR) formulation\n[1] to preclude trivial solutions that arise from scale invariance of the initial MR estimate in Acharyya\net al. [2]. The estimator proposed in the paper is inspired from the the idea of MR and will be revisited\nlater in the paper. In collaborative preference completion, rather than learning a functional mapping\nfrom features to ranking, we seek to exploit the low rank structure in jointly modeling the preferences\nof a collection of entities without access to preference indicative features.\nSingle Index Models (SIMs) Finally, literature on monotonic single index models (SIMs) also\nconsiders estimation under unknown monotonic transformations [17; 20]. However, algorithms for\nSIMs are designed to solve a harder problem of exactly estimating the non\u2013parametric monotonic\ntransformation and are evaluated for parameter recovery rather than the ranking performance. In\ngeneral, with no further assumptions, sample complexity of SIM estimators lends them unsuitable for\nhigh dimensional estimation. The existing high dimensional estimators for learning SIMs typically\nassume Lipschitz continuity of the monotonic transformation which explicitly uses the observed score\nvalues in bounding the Lipsciptz constant of the monotonic transformation [19; 13]. In comparison,\nour proposed model is completely agnostic to the numerical values of the preference scores.\n\n3 Preference Completion from Partial Rankings\nLet the unobserved true preference scores of d2 entities for d1 items be denoted by a rank r (cid:28)\nmin{d1, d2} matrix \u0398\u2217 \u2208 Rd1\u00d7d2. For each entity j \u2208 [d2], we observe a partial or total ordering of\npreferences for a subset of items denoted by Ij \u2282 [d1]. Let nj = |Ij| denotes the number of items\nover which relative preferences of entity j are observed, so that \u2126j = {(i, j) : i \u2208 Ij} denotes the\nj \u2126j denotes the index set collected across entities. Let P\u2126\ndenote the sampling distribution for \u2126. The observed preferences of entity j are typically represented\nby a listwise preference score vector y(j) \u2208 Rnj .\n\nentity-item index set for j, and \u2126 =(cid:83)\n\n\u2200j \u2208 [d2], y(j) = gj(P\u2126j (\u0398\u2217 + W )),\n\n(1)\n\n3\n\n\fwhere each (gj) are an arbitrary and unknown monotonic transformations, and W \u2208 Rd1\u00d7d2 is some\nnon\u2013adversarial noise matrix sampled from the distribution PW . The preference completion task is to\nestimate a unseen rankings within each column of \u0398\u2217 from a subset of orderings (\u2126j, y(j))j\u2208[d2].\nAs (gj) are arbitrary, the exact values of (y(j)) are inconsequential, and the observed preference order\ncan be speci\ufb01ed by a constraint set parameterized by a margin parameter \u0001 as follows:\nDe\ufb01nition 1 (\u0001\u2013margin Isotonic Set) The following set of vectors are isotonic to y \u2208 Rn with an\n\u0001 > 0 margin parameter:\n\nRn\u2193\u0001(y) = {x \u2208 Rn : \u2200 i, k \u2208 [n], yi < yk \u21d2 xi \u2264 xk \u2212 \u0001}.\n\nIn addition to score vectors, isotonic sets of the form Rn\u2193\u0001(y) are equivalently de\ufb01ned for any\nDAG y = G([n], E) which denotes a partial ranking among the vertices, with the convention that\n(i, k) \u2208 E \u21d2 \u2200x \u2208 Rn\u2193\u0001(y), xi \u2264 xk \u2212 \u0001. We note from De\ufb01nition 1 that ties are not broken at\nrandom, e.g., if yi1 = yi2 < yk, then \u2200x \u2208 Rn\u2193\u0001(y), xi1 \u2264 xk \u2212 \u0001, xi2 \u2264 xk \u2212 \u0001, but no particular\nordering between xi1 and xi2 is speci\ufb01ed.\nLet y(k) denote the kth smallest entry of y \u2208 Rn. We distinguish between three special cases of an\nobservation y representing a partial ranking over [n].\n(A) Strict Total Order: y(1) < y(2) < . . . < y(n).\n(B) Blockwise Total Order: y(1) \u2264 y(2) \u2264 . . . \u2264 y(n), with K \u2264 n unique values.\n(C) Arbitrary DAG: Partial order induced by a DAG y = G([n], E).\n\n3.1 Monotone Retargeted Low Rank Estimator\n\nConsider any scalable pointwise learning algorithm that \ufb01ts a model to exact preferences scores.\nSince no generative model (besides monotonicity) is assumed for the raw numerical scores in the\nobservations, in principle, the scores y(j) for entity j can be replaced or retargeted to any ranking-\npreserving scores, i.e., by any vector in Rnj\u2193\u0001 (y(j)). Monotone Retargeting (MR) [2] exploits this\nobservation to address the combinatorial listwise ranking problem [25] while maintaining the relative\nsimplicity and scalability of pointwise estimates (regression). The key idea in MR is to alternately \ufb01t\na pointwise algorithm to current relevance scores, and retarget the scores by searching over the space\nof all monotonic transformations of the scores. Our approach extends and generalizes monotone\nretargeting for the preference prediction task.\n\u2208 Rnj\u2193\u0001 (y(j)),\nWe begin by motivating an algorithm for the noise free setting, where it is clear that \u0398\u2217\nso we seek to estimate a candidate preference matrix X that is in the intersection of (a) the data\nconstraints from the observed preference rankings {X\u2126j \u2208 Rnj\u2193\u0001 (y(j))}, and (b) the model constraints\n\u2013 in this case low rankness induced by constraining the nuclear norm (cid:107)X(cid:107)\u2217. For robust estimation\nin the presence of noise, we may extend the noise free approach by incorporating a soft penalty on\nconstraint violations. Let z \u2208 R|\u2126|, and with slight abuse of notation, let z\u2126j \u2208 Rnj denote vector\nof the entries of z \u2208 R|\u2126| corresponding to \u2126j \u2282 \u2126. Upon incorporating the soft penalties, the\nmonotone retargeted low rank estimator is given by:\n\n\u2126j\n\n(cid:98)X = Argmin\n\nmin\nz\u2208R|\u2126|\n\n\u03bb(cid:107)X(cid:107)\u2217 +\n\n(cid:107)z \u2212 P\u2126(X)(cid:107)2\n\n2\n\n1\n2\n\ns.t.\u2200j, z\u2126j \u2208 Rnj\u2193\u0001 (y(j)),\n\n(2)\n\nX\n\nand (cid:98)X is the set of minimizers of (2). We note that Rn\u2193\u0001(y) is convex, and \u2200\u03bb \u2265 1, the scaling\nwhere the parameter \u03bb controls the trade\u2013off between nuclear norm regularization and data \ufb01t,\n\u03bbRn\u2193\u0001(y) = {\u03bbx \u2200 x \u2208 Rn\u2193\u0001(y)} \u2286 Rn\u2193\u0001(y). The above estimate can be computed using ef\ufb01cient\nconvex optimization algorithms and can handle arbitrary monotonic transformation of the preference\nscores, thus providing higher \ufb02exibility compared to the standard matrix completion.\nAlthough (2) is speci\ufb01ed in terms of two parameters, due to the geometry of the problem, it turns out\nthat \u03bb and \u0001 are not jointly identi\ufb01able, as discussed in the following proposition.\nProposition 1 The optimization in (2) is jointly convex in (X, z). Further, \u2200\u03b3 > 0, (\u03bb, \u03b3\u0001) and\n\n(\u03b3\u22121\u03bb, \u0001) lead to equivalent estimators, speci\ufb01cally (cid:98)X (\u03bb, \u03b3\u0001) = \u03b3\u22121(cid:98)X (\u03b3\u22121\u03bb, \u0001).\nSince, positive scaling of (cid:98)X preserves the resultant preference order, using Proposition 1 without loss\n\nof generality, only one of \u0001 or \u03bb requires tuning with the other remaining \ufb01xed.\n\n4\n\n\f4 Optimization Algorithm\n\noperator of the non\u2013differential component of the estimate \u03bb(cid:107)X(cid:107)\u2217 +(cid:80)\n\nThe optimization problem in (2) is jointly convex in (X, z). Further, we later show that the proximal\nI(z\u2126j \u2208 Rnj\u2193\u0001 (y(j))) is\nef\ufb01ciently computable. This motivates using the proximal gradient descent algorithm [30] to jointly\nupdate (X, z). For an appropriate step size \u03b1 = 1/2 and the resulting updates are as follows:\n\u2022 X Update: Singular Value Thresholding The proximal operator for \u03c4(cid:107).(cid:107)\u2217 is the singular value\nthresholding operator S\u03c4 . For X with singular value decomposition X = U \u03a3V and \u03c4 \u2265 0,\nS\u03c4 (X) = U s\u03c4 (\u03a3)V, where s\u03c4 is the soft thresholding operator given by s\u03c4 (x)i = max{xi \u2212 \u03c4, 0}.\n\u2022 z Update: Parallel Projections For hard constraints on z, the proximal operator at v is the\nEuclidean projection on the constraints given by z \u2190 argminz(cid:107)z\u2212v(cid:107)2\n2, s.t. z\u2126j \u2208 Rnj\u2193\u0001 (y(j)) \u2200j \u2208\n[d2]. These updates decouple along each entity (column) z\u2126j and can be trivially parallelized.\nEf\ufb01cient projections onto Rnj\u2193\u0001 (y(j)) are discussed Section 4.1.\n\nj\n\nAlgorithm 1 Proximal Gradient Descent for (2) with input \u2126,{y(j)\n\nj }, \u0001 and paramter \u03bb\n\n(cid:16)\n\nfor k = 0, 1, 2, . . . , Until (stopping criterion)\n\u2126(z(k) \u2212 X (k)\n\u2126 )\n\nX (k+1) =S\u03bb/2\n\nX (k) + 1\n\n(cid:17)\n\n\u2200j, z(k+1)\n\n\u2126j\n\n= ProjRnj\u2193\u0001 (yj )\n\n2 (P\u2217\n(cid:16) z\n\n(k)\n\u2126j\n\n(cid:17)\n\n.\n\n+X\n\n(k)\n\u2126j\n\n2\n\n,\n\n(3)\n\n(4)\n\n4.1 Projection onto Rn\u2193\u0001(y)\nWe begin with the following de\ufb01nitions that are used in characterizing Rn\u2193\u0001(y).\nDe\ufb01nition 2 (Adjacent difference operator) The adjacent difference operator in Rn, denoted by\nDn : Rn \u2192 Rn\u22121 is de\ufb01ned as (Dnx)i = xi \u2212 xi+1, for i \u2208 [n \u2212 1].\nDe\ufb01nition 3 (Incidence Matrix) For a directed graph G(V, E), the incidence matrix AG \u2208 R|V |\u00d7|E|\nif the jth directed edge ej \u2208 E is from ith node to kth node, then (AG)ij = 1,\nis such that:\n(AG)kj = \u22121, and (AG)lj = 0, \u2200l (cid:54)= i or k.\nProjection onto Rn\u2193\u0001(y) is closely related to the isotonic regression problem of \ufb01nding a univariate\nleast squares \ufb01t under consistent order constraints (without margins). This isotonic regression problem\nin Rn can be solved exactly in O(n) complexity using the classical Pool of Adjacent Violators (PAV)\nalgorithm [15; 4] as:\n\nPAV(v) = argmin\nz(cid:48)\u2208Rn\n\n||z(cid:48) \u2212 v||2 s.t. z(cid:48)\n\ni \u2212 z(cid:48)\n\ni+1 \u2264 0.\n\n(5)\n\nAs we discuss, simple adaptations of isotonic regression can be used for projection onto \u0001-margin\nisotonic sets for the three special cases of interest as summarized in Table 1.\n(A) Strict Total Order: y(1) < y(2) < . . . y(n)\nIn this setting, the constraint set can be characterized as Rn\u2193\u0001(y) = {x : Dnx \u2264 \u2212\u00011}, where 1 is a\nvector of ones. For this case projection onto Rn\u2193\u0001(y) differs from (5) only in requiring an \u0001\u2013separation\nand a straight forward extension of the PAV algorithm [4] can be used. Let dsl \u2208 Rn be any vector\nsuch that 1 = \u2212Dndsl, then by simple substitutions, ProjRn\u2193\u0001(y)(x) = PAV(x \u2212 \u0001dsl) + \u0001dsl.\n(B) Blockwise Total Order: y(1) \u2264 y(2) \u2264 . . . \u2264 y(n)\nThis is a common setting for supervision in many preference completion applications, where the\nlistwise ranking preferences obtained from ratings over discrete quantized levels 1, 2, . . . , K, with\nK (cid:28) n are prevalent. Let y be partitioned into K \u2264 n blocks P = {P1, P2, . . . PK}, such that the\nentries of y within each partition are equal, and the blocks themselves are strictly ordered,\n\ni.e., \u2200k \u2208 [K], sup y(Pk\u22121) < inf y(Pk) = sup y(Pk) < inf y(Pk+1),\n\nwhere P0 = PK+1 = \u03c6, and y(P ) = {yi : i \u2208 P}.\n\n5\n\n\fi = (cid:80)K\n[1, 1, ..(cid:12)(cid:12)2, 2, ..(cid:12)(cid:12)K, K, .., K](cid:62). Let \u03a0P be a set of valid permutations that permute entries only within\nLet dbl \u2208 Rn be such that dbl\nk=1 k Ii\u2208Pk is a vector of block indices dbl =\nsteps to compute(cid:98)z = ProjRn\u2193\u0001(y)(x) in this case:\nblocks {Pk \u2208 P}, then Rn\u2193\u0001(y) = {x :\u2203\u03c0\u2208 \u03a0P , Dn\u03c0(x) \u2264 \u2212\u0001Dndbl}. We propose the following\n\n(6)\n\nStep 1. \u03c0\u2217(x) s.t. \u2200k \u2208 [K], \u03c0\u2217(x)Pk = sort(xPk )\nStep 2.(cid:98)z = P AV (\u03c0\u2217(x) \u2212 \u0001dbl) + \u0001dbl.\n\nThe correctness of (6) is summarized by the following Lemma.\n\nLemma 2 Estimate(cid:98)z from (6) is the unique minimizer for\n\n(cid:107)z \u2212 x(cid:107)2\n\n2 s.t. \u2203\u03c0 \u2208 \u03a0P : Dn\u03c0(z) \u2264 \u2212\u0001Dndbl.\n\nargmin\n\nz\n\n(C) Arbitrary DAG: y = G([n], E)\nAn arbitrary DAG (not necessarily connected) can be used to represent any consistent order constraints\nover its vertices, e.g., partial rankings consolidated from multiple listwise/pairwise scores. In this\ncase, the \u0001\u2013margin isotonic set is given by Rn\u2193\u0001(y) = {x : A(cid:62)\nG x \u2264 \u2212\u00011} (c.f. De\ufb01nition 3).\nConsider dDAG \u2208 Rn such that ith entry dDAG\nis the length of the longest directed chain connecting\nthe topological descendants of the node i. It can be easily veri\ufb01ed that, the isotonic regression\nalgorithm for arbitrary DAGs applied on x \u2212 \u0001dDAG gives the projection onto Rn\u2193\u0001(y). In this most\ngeneral setting, the best isotonic regression algorithm for exact solution requires O(nm2 + n3 log n2)\ncomputation [34], where m is the number of edges in G. While even in the best case of m = o(n),\nthe computation can be prohibitive, we include this case for completeness. We also note that this\ncase of partial DAG ordering cannot be handled in the standard matrix completion setting without\nconsolidating the partial ranks to total order.\n\ni\n\nRn\u2193\u0001(y)\n{x : Dnx \u2264 \u2212\u00011}\n{x :\u2203\u03c0\u2208 \u03a0P , Dn\u03c0(x) \u2264 \u2212\u00011}\nG x \u2264 \u2212\u00011}\n{x : A(cid:62)\n\n(A)\n(B)\n(C)\n\n(cid:0)PAV(\u03c0\u2217\n\nProjRn\u2193\u0001(y)(x)\nPAV(x \u2212 \u0001dsl) + \u0001dsl\n\u03c0\u2217\u22121\nIsoReg(x \u2212 \u0001dDAG,G)+\u0001dDAG[34] O(n2m + n3 log n)\n\nP (x) \u2212 \u0001dbl) + \u0001dbl) O(n log n)\n\nComputation\nO(n)\n\nP\n\nTable 1: Summary of algorithms for ProjRn\u2193\u0001(y)(x)\n\n4.2 Computational Complexity\n\n2(cid:107)P\u2126(X) \u2212 z(cid:107)2\n\nIt can be easily veri\ufb01ed that gradient of 1\n2 is 2\u2013Lipschitz continuous. Thus, from\nstandard results on convegence proximal gradient descent [30], Algorithm 1,converges to within\nan \u0001 error in objective in O(1/\u0001) iterations. Compared to proximal algorithms for standard matrix\ncompletion [5; 27], the additional complexity in Algorithm 1 arises in the z update (4), which is a\nsimple substitution z(k) = X (k)\n\u2126 in standard matrix completion. For total orders, the z update of (4)\nis highly ef\ufb01cient and is asymptotically within an additional log |\u2126| factor of the computational costs\nfor standard matrix completion.\n\n5 Generalization Error\nRecall that yj are (noisy) partial rankings of subset of items for each user, obtained from gj(\u0398\u2217\nj + Wj)\nwhere W is a noise matrix and gj are unknown and arbitrary transformations that only preserve that\nranking order within each column. The estimator and the algorithms described so far are independent\nof the sampling distribution generating (\u2126,{yj}). In this section we quantify simple generalization\nerror bounds for (2).\nAssumption 1 (Sampling (P\u2126)) For a \ufb01xed W and \u0398\u2217, we assume the following sampling distri-\nbution. Let be c0 a \ufb01xed constant and R be pre\u2013speci\ufb01ed parameter denoting the length of single\nlistwise observation. For s = 1, 2, . . . ,|S| = c0d2 log d2,\n\nj(s) \u223c uniform[d2], I(s) \u223c randsample([d1], R),\n\n\u2126(s) = {(i, j(s)) : i \u2208 I(s)},\n\ny(s) = gj(s)(P\u2126(s)(\u0398\u2217 + W )).\n\n(7)\n\n6\n\n\fFurther, we de\ufb01ne the notation: \u2200j,Ij =(cid:83)\n\ns:j(s)=j I(s), \u2126j =(cid:83)\n\ns:j(s)=j \u2126(s), and nj = |\u2126j|.\nFor each column j, the listwise scores {y(s) : j(s) = j} jointly de\ufb01ne a consistent partial ranking of\nIj as the scores are subsets of a monotonically transformed preference vector gj(\u0398\u2217\nj + Wj). This\nconsistent ordering is represented by a DAG y(j) = PartialOrder({y(s) : j(s) = j}). We also note\nthat O(d2 log d2) samples ensures that each column is included in the sampling with high probability.\nDe\ufb01nition 4 (Projection Loss) Let y = G([n], E) or y \u2208 Rn de\ufb01ne a partial ordering or total order\nin Rn, respectively. We de\ufb01ne the following convex surrogate loss over partial rankings.\n\n\u03a6(x, y) = minz\u2208Rn\u2193\u0001(y) (cid:107)x \u2212 z(cid:107)2\n\nTheorem 3 (Generalization Bound) Let (cid:98)X be an estimate from (2). With appropriate scaling let\n(cid:107)(cid:98)X(cid:107)F = 1 , then for constants K1 K2, the following holds with probability greater than 1 \u2212 \u03b4 over\nEy(s),\u2126(s)\u03a6((cid:98)X\u2126(s), y(s)) \u2264 1\n\nall observed rankings {y(j), \u2126j : j \u2208 [d2]} drawn from (7) with |S| \u2265 c0d2 log d2:\n\n\u03a6((cid:98)X\u2126(s), y(s)) + K1\n\n(cid:107)(cid:98)X(cid:107)\u2217 log1/4 d\n\n(cid:115)\n\n(cid:115)\n\n\u221a\n\nd log d\nR|S| + K2\n\nlog 2/\u03b4\n|S|\n\n.\n\n|S|(cid:88)\n\ns=1\n\n|S|\n\nd1d2\n\nTheorem 3 quanti\ufb01es the test projection loss over a random R length items I(s) drawn for a random\nentity/user j(s). The bound provides a trade\u2013off between observable training error and complexity\nde\ufb01ned by nuclear norm of the estimate.\n\n6 Experiments\n\nWe evaluate our model on two collaborative preference estimation tasks: (a) a standard user-item\nrecommendataion task on a benchmarked dataset from Movielens, and (b) identifying associations\nbetween brain\u2013regions and cognitive terms using the neurosynth dataset [37].\n\nBaselines: The following baseline models are compared in our experiments:\n\u2022 Retargeted Matrix Completion (RMC): the estimator proposed in (2).\n\u2022 Standard Matrix Completion (SMC) [8]: We primarily compare our estimator with the\n\nstandard convex estimator for matrix completion using nuclear norm minimization.\n\n\u2022 Collaborative Filtering Ranking CoFi-Rank [35]: This work addresses collaborative \ufb01ltering\n\ntask in a listwise ranking setting.\n\nFor SMC and MRPC, the hyperparameters were tuned using grid search on a logarithmic scale.\nDue to high computational cost with tuning parameters in Co\ufb01Rank, we use the code and default\nparameters provided by the authors.\n\nEvaluation metrics: The performance on preference estimation tasks are evaluated on four rank-\ning metrics: (a) Normalized Discounted Cummulative Gains (NDCG@N), (b) Precision@N, (c)\nSpearmann Rho, and (d) Kendall Tau, where the later two metrics measure the correlation of the\ncomplete ordering of the list, while the former two metrics primarily evaluate the correctness of\nranking in the top of the list (see Liu et. al. [25] for further details on these metrics).\n\nMovielens dataset (blockwise total order) Movielens is a movie recommendation website admin-\nistered by GroupLens Research. We used competitive benchmarked movielens 100K dataset. We\nused the 5\u2013fold train/test splits provided with the dataset (the test splits are non-overlapping). We\ndiscarded a small number of users that had less than 10 ratings in any of 5 training data splits. The\nresultant dataset consists of 923 users and 1682 items. The ratings are blockwise ordered \u2013 taking\none of 5 values in the set {1, 2, . . . , 5}. During testing, for each user, the competing models return\na ranking of the test-items, and the performance is averaged across test-users. Table 2 presents the\nresults of our evaluation averaged across 5 train/test splits on the Movielens dataset, along with the\nstandard deviation. We see that the proposed retargeted matrix completion (RMC) signi\ufb01cantly and\nconsistently outperforms SMC and CoFi-Rank [35] across ranking metrics.\n\n7\n\n\fRMC\nSMC\nCoFi-Rank\n\nNDCG@5\n0.7984(0.0213)\n0.7863(0.0243)\n0.7731(0.0213)\n\nPrecision@5\n0.7546(0.0320)\n0.7429(0.0295)\n0.7314(0.0293)\n\nSpearman Rho\n0.4137(0.0099)\n0.3722(0.0106)\n0.3681(0.0082)\n\nKendall Tau\n0.3383(0.0117)\n0.3031(0.0117)\n0.2993(0.0110)\n\nTable 2: Ranking performance for recommendations in Movielens 100K. Table shows mean and standard\ndeviation over 5 fold train/test splits. For all reported metrics, higher values are better [25].\n\nNeurosynth Dataset (almost total order) Neurosynth[37] is a publicly available database con-\nsisting of data automatically extracted from a large collection of functional magnetic resonance\nimaging (fMRI) publications (11,362 publications in current version). For each publication , the\ndatabase contains the abstract text and all reported 3-dimensional peak activation coordinates in\nthe study. The text is pre-processed to remove common stop-words, and any text with less than\n.1% frequency, leaving a total of 3169 terms. We applied the standard brain map to the activations,\nremoving voxels outside of the grey matter. Next the activations were downsampled from 2mm3\nvoxels to 10mm3 voxels using the nilearn python package, resulting in a total of 1231 dense voxels.\nThe af\ufb01nity measure between 3169 terms and 1231 consolidated voxels is obtained by multiplying\nthe term \u00d7 publication and the publication \u00d7 voxels matrices. The resulting data is dense high-rank\npreference matrix. With very few tied preference values, this setting best \ufb01ts the case of total ordered\nobservations (case A in Section 4.1). Using this data, we consider the reverse inference task of\nranking cognitive concepts (terms) for each brain region (voxel) [37].\nTrain-val-test: We used 10% of randomly sampled entries of the matrix as test data and a another\n10% for validation. We created training datasets at various sample sizes by subsampling from the\nremaining 80% of the data. This random split is replicated multiple times to obtain 3 bootstrapped\ndatasplits (note that unlike cross validation, the test datasets here can have some overlapping entries).\nThe results in Fig. 1 show that the proposed estimate from (2) outperforms standard matrix completion\nin terms of popular ranking metrics.\n\nFigure 1: Ranking performance for reverse inference in Neurosynth data. x-axis denotes the fraction of the\naf\ufb01nity matrix entries used as observations in training. Plots show mean with errorbars for standard deviation\nover 3 bootstrapped train/test splits. For all the reported ranking metrics, higher values are better[25].\n\n7 Conclusion\n\nOur work addresses the problem of collaboratively ranking; a task of growing importance to modern\nproblems in recommender systems, large scale meta-analysis, and related areas. We proposed a\nnovel convex estimator for collaborative LETOR from sparsely observed preferences, where the\nobservations could be either score vectors representing total order, or more generally directed acyclic\ngraphs representing partial orders. Remarkably, in the case of complete order, the complexity of our\nalgorithm is within a log factor of the state\u2013of\u2013the\u2013art algorithms for standard matrix completion.\nOur estimator was empirically evaluated on real data experiments.\n\nAcknowledgments SG and JG acknowledge funding from NSF grants IIS-1421729 and SCH 1418511.\n\n8\n\n\fReferences\n[1] S. Acharyya and J. Ghosh. MEMR: A margin equipped monotone retargeting framework for ranking. In\n\n[2] S. Acharyya, O. Koyejo, and J. Ghosh. Learning to rank with bregman divergences and monotone\n\n[3] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural\n\n[4] M. J. Best and N. Chakravarti. Active set algorithms for isotonic regression; a unifying framework. Math.\n\nUAI, 2013.\n\nretargeting. In UAI, 2012.\n\nresults. JMLR, 2003.\n\nProgram., 1990.\n\nJ. Optim., 2010.\n\n[5] J. F. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM\n\n[6] E. J. Cand\u00e9s and Y. Plan. Matrix completion with noise. Proc. IEEE, 2010.\n[7] E. J. Cand\u00e9s and B. Recht. Exact matrix completion via convex optimization. FoCM, 2009.\n[8] E. J. Cand\u00e9s, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from\n\nhighly incomplete frequency information. IEEE Trans. Inf. Theory, 2006.\n\n[9] Z. Cao, T. Qin, T. Y. Liu, M. F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise\n\napproach. In ICML, 2007.\n\nGenome Res., 2013.\n\ntasks. In RecSys. ACM, 2010.\n\n[10] E. Chi, H. Zhou, G. Chen, D. O. Del Vecchyo, and K. Lange. Genotype imputation via matrix completion.\n\n[11] P. Cremonesi, Y. Koren, and R. Turrin. Performance of recommender algorithms on top-n recommendation\n\n[12] J. C. Duchi, L. W Mackey, and M. I. Jordan. On the consistency of ranking algorithms. In ICML, 2010.\n[13] R. Ganti, L. Balzano, and R. Willett. Matrix completion under monotonic single index models. In NIPS,\n\n2015.\n\n1999.\n\n[14] D. Goldberg, D. Nichols, B. Oki, and D. Terry. Using collaborative \ufb01ltering to weave an information\n\ntapestry. Commun. ACM, 1992.\n\n[15] S.J. Grotzinger and C. Witzgall. Projections onto order simplexes. Appl. Math. Optim., 1984.\n[16] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In NIPS,\n\n[17] J. L. Horowitz. Semiparametric and nonparametric methods in econometrics, volume 12. Springer, 2009.\n[18] T. Joachims. Optimizing search engines using clickthrough data. In SIGKDD, 2002.\n[19] S. M. Kakade, V. Kanade, O. Shamir, and A. Kalai. Ef\ufb01cient learning of generalized linear and single\n\nindex models with isotonic regression. In NIPS, 2011.\n\n[20] A. T. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009.\n[21] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE Transactions on\n\n[22] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE\n\n[23] O. Koyejo, S. Acharyya, and J. Ghosh. Retargeted matrix factorization for collaborative \ufb01ltering. In\n\n[24] P. Li, Q. Wu, and C. J. Burges. Mcrank: Learning to rank using multiple classi\ufb01cation and gradient\n\n[25] T. Y. Liu. Learning to rank for information retrieval. Foundations and Trends in IR, 2009.\n[26] Y. Lu and S. N. Negahban. Individualized rank aggregation using nuclear norm regularization. In Annual\n\nAllerton Conference on Communication, Control, and Computing (Allerton), 2015.\n\n[27] S. Ma, D. Goldfarb, and L. Chen. Fixed point and bregman iterative methods for matrix rank minimization.\n\nMath. Program., 2011.\n\n[28] A. Mnih and R. Salakhutdinov. Probabilistic matrix factorization. In NIPS, 2007.\n[29] S. Oh, K. K. Thekumparampil, and J. Xu. Collaboratively learning preferences from ordinal data. In NIPS,\n\n[30] N. Parikh and S. P. Boyd. Proximal algorithms. Foundations and Trends in optimization, 2014.\n[31] D. Park, J. Neeman, J. Zhang, S. Sanghavi, and I. Dhillon. Preference completion: Large-scale collaborative\n\nranking from pairwise comparisons. In ICML, 2015.\n\n[32] B. Recht. A simpler approach to matrix completion. JMLR, 2011.\n[33] H. Steck. Training and testing of recommender systems on data missing not at random. In KDD. ACM,\n\n2015.\n\n2010.\n\n[34] Q. F. Stout. Isotonic regression via partitioning. Algorithmica, 2013.\n[35] M. Weimer, A. Karatzoglou, Q. V. Le, and A. J. Smola. COFIRANK - maximum margin matrix factorization\n\nfor collaborative ranking. In NIPS, 2008.\n\n[36] T. Yarkoni, R. A. Poldrack, T. E. Nichols, D. C. Van Essen, and T. D. Wager. Large-scale automated\n\nsynthesis of human functional neuroimaging data. Nat. Methods, 2011.\n\n[37] Tal Yarkoni. http://neurosynth.org/. http://neurosynth.org/, 2011.\n[38] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative \ufb01ltering for the net\ufb02ix\n\nprize. In Algorithmic Aspects in Information and Management, LNCS 5034, 2008.\n\nIT, 2010.\n\nComputer, 2009.\n\nRecSys, 2013.\n\nboosting. In NIPS, 2007.\n\n9\n\n\f", "award": [], "sourceid": 754, "authors": [{"given_name": "Suriya", "family_name": "Gunasekar", "institution": "UT Austin"}, {"given_name": "Oluwasanmi", "family_name": "Koyejo", "institution": "UIUC"}, {"given_name": "Joydeep", "family_name": "Ghosh", "institution": "UT Austin"}]}