{"title": "Finite Sample Prediction and Recovery Bounds for Ordinal Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 2711, "page_last": 2719, "abstract": "The goal of ordinal embedding is to represent items as points in a low-dimensional Euclidean space given a set of constraints like ``item $i$ is closer to item $j$ than item $k$''. Ordinal constraints like this often come from human judgments. The classic approach to solving this problem is known as non-metric multidimensional scaling. To account for errors and variation in judgments, we consider the noisy situation in which the given constraints are independently corrupted by reversing the correct constraint with some probability. The ordinal embedding problem has been studied for decades, but most past work pays little attention to the question of whether accurate embedding is possible, apart from empirical studies. This paper shows that under a generative data model it is possible to learn the correct embedding from noisy distance comparisons. In establishing this fundamental result, the paper makes several new contributions. First, we derive prediction error bounds for embedding from noisy distance comparisons by exploiting the fact that the rank of a distance matrix of points in $\\R^d$ is at most $d+2$. These bounds characterize how well a learned embedding predicts new comparative judgments. Second, we show that the underlying embedding can be recovered by solving a simple convex optimization. This result is highly non-trivial since we show that the linear map corresponding to distance comparisons is non-invertible, but there exists a nonlinear map that is invertible. Third, two new algorithms for ordinal embedding are proposed and evaluated in experiments.", "full_text": "Finite Sample Prediction and Recovery Bounds\n\nfor Ordinal Embedding\n\nLalit Jain\n\nUniversity of Michigan\nAnn Arbor, MI 48109\nlalitj@umich.edu\n\nKevin Jamieson\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nkjamieson@berkeley.edu\n\nRobert Nowak\n\nUniversity of Wisconsin\n\nMadison, WI 53706\nrdnowak@wisc.edu\n\nAbstract\n\nThe goal of ordinal embedding is to represent items as points in a low-dimensional\nEuclidean space given a set of constraints like \u201citem i is closer to item j than\nitem k\u201d. Ordinal constraints like this often come from human judgments. The\nclassic approach to solving this problem is known as non-metric multidimensional\nscaling. To account for errors and variation in judgments, we consider the noisy\nsituation in which the given constraints are independently corrupted by reversing\nthe correct constraint with some probability. The ordinal embedding problem has\nbeen studied for decades, but most past work pays little attention to the question\nof whether accurate embedding is possible, apart from empirical studies. This\npaper shows that under a generative data model it is possible to learn the correct\nembedding from noisy distance comparisons. In establishing this fundamental\nresult, the paper makes several new contributions. First, we derive prediction error\nbounds for embedding from noisy distance comparisons by exploiting the fact\nthat the rank of a distance matrix of points in Rd is at most d + 2. These bounds\ncharacterize how well a learned embedding predicts new comparative judgments.\nSecond, we show that the underlying embedding can be recovered by solving a\nsimple convex optimization. This result is highly non-trivial since we show that\nthe linear map corresponding to distance comparisons is non-invertible, but there\nexists a nonlinear map that is invertible. Third, two new algorithms for ordinal\nembedding are proposed and evaluated in experiments.\n\n1 Ordinal Embedding\n\nOrdinal embedding aims to represent items as points in Rd so that the distances between items agree\nas well as possible with a given set of ordinal comparisons such as item i is closer to item j than\nto item k. In other words, the goal is to \ufb01nd a geometric representation of data that is faithful to\ncomparative similarity judgments. This problem has been studied and applied for more than 50 years,\ndating back to the classic non-metric multidimensional scaling (NMDS) [1, 2] approach, and it is\nwidely used to gauge and visualize how people perceive similarities.\nDespite the widespread application of NMDS and recent algorithmic developments [3, 4, 5, 6, 7],\nthe fundamental question of whether an embedding can be learned from noisy distance/similarity\ncomparisons had not been answered. This paper shows that if the data are generated according to\na known probabilistic model, then accurate recovery of the underlying embedding is possible by\nsolving a simple convex optimization, settling this long-standing open question. In the process of\nanswering this question, the paper also characterizes how well a learned embedding predicts new\ndistance comparisons and presents two new computationally ef\ufb01cient algorithms for solving the\noptimization problem.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f1.1 Related Work\nThe classic approach to ordinal embedding is NMDS [1, 2]. Recently, several authors have proposed\nnew approaches based on more modern techniques. Generalized NMDS [3] and Stochastic Triplet\nEmbedding (STE) [6] employ hinge or logistic loss measures and convex relaxations of the low-\ndimensionality (i.e., rank) constraint based on the nuclear norm. These works are most closely related\nto the theory and methods in this paper. The Linear partial order embedding (LPOE) method is\nsimilar, but starts with a known Euclidean embedding and learns a kernel/metric in this space based\ndistance comparison data [7]. The Crowd Kernel [4] and t-STE [6] propose alternative non-convex\nloss measures based on probabilistic generative models. The main contributions in these papers are\nnew optimization methods and experimental studies, but did not address the fundamental question\nof whether an embedding can be recovered under an assumed generative model. Other recent work\nhas looked at the asymptotics of ordinal embedding, showing that embeddings can be learned as the\nnumber of items grows and the items densely populate the embedding space [8, 9, 10]. In contrast,\nthis paper focuses on the practical setting involving a \ufb01nite set items. Finally, it is known that at least\n2dn log n distance comparisons are necessary to learn an embedding of n points in Rd [5].\n\n1.2 Ordinal Embedding from Noisy Data\nConsider n points x1, x2, . . . , xn 2 Rd. Let X = [x1 \u00b7\u00b7\u00b7 xn] 2 Rd\u21e5n. The Euclidean distance\nmatrix D? is de\ufb01ned to have elements D?\n2. Ordinal embedding is the problem of\nrecovering X given ordinal constraints on distances. This paper focuses on \u201ctriplet\u201d constraints of\nthe form D?\nik, where 1 \uf8ff i 6= j 6= k \uf8ff n. Furthermore, we only observe noisy indications of\nthese constraints, as follows. Each triplet t = (i, j, k) has an associated probability pt satisfying\n\nij = kxi xjk2\n\nij < D?\n\npt > 1/2 () kxi xjk2 < kxi xkk2 .\n\nLet S denote a collection of triplets drawn independently and uniformly at random. And for\neach t 2S we observe an independent random variable yt = 1 with probability pt, and yt = 1\notherwise. The goal is to recover the embedding X from these data. Exact recovery of D? from such\ndata requires a known link between pt and D?. To this end, our main focus is the following problem.\n\nOrdinal Embedding from Noisy Data\nConsider n points x1, x2 \u00b7\u00b7\u00b7 , xn in d-dimensional Euclidean space. Let S denote a collection of\ntriplets and for each t 2S observe an independent random variable\n\nyt = 8<:\n\n1 w.p. f (D?\n1\n\nij D?\nik)\n\nw.p. 1 f (D?\n\nij D?\nik)\n\n.\n\nwhere the link function f : R ! [0, 1] is known. Estimate X from S, {yt}, and f.\nFor example, if f is the logistic function, then for triplet t = (i, j, k)\n\nij D?\n\nik) =\n\n1\n1 + exp(D?\n\nij D?\nik)\n\npt = P(yt = 1) = f (D?\nik = log 1pt\n\nij D?\n\npt . However, we stress that we only require the existence of a link\n\nthen D?\nfunction for exact recovery of D?. Indeed, if one just wishes to predict the answers to unobserved\ntriplets, then the results of Section 2 hold for arbitrary pt probabilities. Aspects of the statistical\nanalysis are related to one-bit matrix completion and rank aggregation [11, 12, 13]. However, we\nuse novel methods for the recovery of the embedding based on geometric properties of Euclidean\ndistance matrices.\n\n,\n\n(1)\n\n1.3 Organization of Paper\nThis paper takes the following approach to ordinal embedding.\n\n1. Our samples are assumed to be independently generated according to a probabilistic model based\non an underlying low-rank distance matrix. We use relatively standard statistically learning theory\n\n2\n\n\ftechniques to analyze the minimizer of a bounded, Lipschitz loss with a nuclear norm constraint,\nand show that an embedding can be learned from the data that predicts nearly as well as the true\nembedding with O(dn log n) samples (Theorem 1).\n2. Next, assuming the form of the probabilistic generative model is known (e.g., logistic), we show\nthat if the learned embedding is a good predictor of the ordinal comparisons, then it must also be a\ngood estimator of the true differences of distances between the embedding points (Theorem 2). This\nresult hinges on the fact that the (linear) observation model acts approximately like an isometry on\ndifferences of distances.\n3. While the true differences of distances can be estimated, the observation process is \u201cblind\u201d to the\nmean distance between embedding points. Despite this, we show that the mean is determined by the\ndifferences of distances, due to the special properties of Euclidean distance matrices. Speci\ufb01cally,\nthe second eigenvalue of the \u201cmean-centered\u201d distance matrix (well-estimated by the data from the\nestimate of the differences of distances, Theorem 3) is proportional to the mean distance (Theorem 4).\nThis allows us to show that the minimizer of the loss with a nuclear norm constraint indeed recovers\nan accurate estimate of the underlying true distance matrix.\n\n1.4 Notation and Assumptions\nWe will use (D?, G?) to denote the distance and Gram matrices of the latent embedding, and (D, G)\nto denote an arbitrary distance matrix and its corresponding Gram matrix. The observations {yt} carry\ninformation about D?, but distance matrices are invariant to rotation and translation, and therefore\nit may only be possible to recover X up to a rigid transformation. Without loss of generality, we\nassume assume the points x1, . . . xn 2 Rd are centered at the origin (i.e.,Pn\nn 11T . If X is centered, XV = X. Note that D? is\nDe\ufb01ne the centering matrix V := I 1\ndetermined by the Gram matrix G? = X T X. In addition, X can be determined from G? up\nto a unitary transformation. Note that if X is centered, the Gram matrix is \u201ccentered\u201d so that\nV G?V = G?. It will be convenient in the paper to work with both the distance and Gram matrix\nrepresentations, and the following identities will be useful to keep in mind. For any distance matrix\nD and its centered Gram matrix G\n\ni=1 xi = 0).\n\n1\n2\n\nV DV ,\n\nG = \nD = diag(G)1T 2G + 1diag(G)T ,\n\n(3)\nwhere diag(G) is the column vector composed of the diagonal of G. In particular this establishes a\nbijection between centered Gram matrices and distance matrices. We refer the reader to [14] for an\ninsightful and thorough treatment of the properties of distance matrices. We also de\ufb01ne the set of all\nunique triplets\n\n(2)\n\nT := (i, j, k) : 1 \uf8ff i 6= j 6= k \uf8ff n, j < k .\n\nAssumption 1. The observed triplets in S are drawn independently and unifomly from T .\n2 Prediction Error Bounds\nFor t 2T with t = (i, j, k) we de\ufb01ne Lt to be the linear operator satisfying Lt(X T X) =\nkxi xjk2 kxi xkk2 for all t 2T . In general, for any Gram matrix G\n\nLt(G) := Gjj 2Gij Gkk + 2Gik.\n\nWe can naturally view Lt as a linear operator on Sn\n+, the space of n\u21e5n symmetric positive semide\ufb01nite\nmatrices. We can also represent Lt as a symmetric n \u21e5 n matrix that is zero everywhere except on\nthe submatrix corresponding to i, j, k which has the form\n\n\"\n\nand so we will write\n\n0 1\n1\n1\n1\n\n1\n0\n\n0 1 #\n\nLt(G) := hLt, Gi\n\n3\n\n\fwhere hA, Bi = vec(A)T vec(B) for any compatible matrices A, B. Ordering the elements of T\n2 -dimensional vector\nlexicographically, we arrange all the Lt(G) together to de\ufb01ne the nn1\n(4)\nL(G) = [L123(G),L124(G),\u00b7\u00b7\u00b7 ,Lijk(G),\u00b7\u00b7\u00b7 ]T .\nLet `(ythLt, Gi) denote a loss function. For example we can consider the 0 1 loss `(ythLt, Gi) =\n{sign{ythLt,Gi}6=1}, the hinge-loss `(ythLt, Gi) = max{0, 1 ythLt, Gi}, or the logistic loss\n1\n\n(5)\nLet pt := P(yt = 1) and take the expectation of the loss with respect to both the uniformly random\nselection of the triple t and the observation yt, we have the risk of G\n\n`(ythLt, Gi) = log(1 + exp(ythLt, Gi)).\n\n1\n\n|T |Xt2T\n|S|Xt2S\n\n1\n\nbRS(G) =\n\nR(G)\n\n:= E[`(ythLt, Gi)] =\n\npt`(hLt, Gi) + (1 pt)`(hLt, Gi).\n\nGiven a set of observations S under the model de\ufb01ned in the problem statement, the empirical risk is,\n(6)\n\n`(ythLt, Gi)\n\n+, let kGk\u21e4\n(7)\n\nwhich is an unbiased estimator of the true risk: E[bRS(G)] = R(G). For any G 2 Sn\ndenote the nuclear norm and kGk1 := maxij |Gij|. De\ufb01ne the constraint set\nWe estimate G? by bG, the solution of the program,\nbG := argmin\n\nG2G,bRS(G) .\n\n+ : kGk\u21e4 \uf8ff ,kGk1 \uf8ff } .\n\nG, := {G 2 Sn\n\nSince G? is positive semide\ufb01nite, we expect the diagonal entries of G? to bound the off-diagonal\nentries. So an in\ufb01nity norm constraint on the diagonal guarantees that the points x1, . . . , xn corre-\nsponding to G? live inside a bounded `2 ball. The `1 constraint in (7) plays two roles: 1) if our loss\nfunction is Lipschitz, large magnitude values of hLt, Gi can lead to large deviations of bRS(G) from\nR(G); bounding ||G||1 bounds |hLt, Gi|. 2) Later we will de\ufb01ne ` in terms of the link function\nf and as the magnitude of hLt, Gi increases the magnitude of the derivative of the link function f\ntypically becomes very small, making it dif\ufb01cult to \u201cinvert\u201d; bounding ||G||1 tends to keep hLt, Gi\nwithin an invertible regime of f.\nTheorem 1. Fix , and assume G? 2G ,. If the loss function `(\u00b7) is L-Lipschitz (or | supy `(y)|\uf8ff\nL max{1, 12}) then with probability at least 1 ,\n\n(8)\n\nR(bG) R(G?) \uf8ff\n\n4L\n\n|S| r 18|S| log(n)\n\nn\n\n+\n\np3\n3\n\nlog n! + Ls 288 log 2/\n\n|S|\n\nProof. The proof follows from standard statistical learning theory techniques, see for instance [15].\nBy the bounded difference inequality, with probability 1 \nR(bG) R(G?) = R(bG) bRS(bG) + bRS(bG) bRS(G?) + bRS(G?) R(G?)\nwhere supG2G, `(ythLt, Gi) `(yt0hLt0, Gi) \uf8ff supG2G, L|hytLt yt0Lt0, Gi| \uf8ff 12L =: B\nusing the facts that Lt has 6 non-zeros of magnitude 1 and ||G||1 \uf8ff .\nUsing standard symmetrization and contraction lemmas, we can introduce Rademacher random\nvariables \u270ft 2 {1, 1} for all t 2S so that\n\nG2G, |bRS(G) R(G)|] +s 2B2 log 2/\n\nG2G, |bRS(G) R(G)|\uf8ff 2E[ sup\n\n\uf8ff 2 sup\n\n|S|\n\nE sup\n\nG2G, |bRS(G) R(G)|\uf8ff E sup\n\nG2G,\n\n4\n\n2L\n\n|S|Xt2S\n\n\u270fthLt, Gi\n\n.\n\n\fThe right hand side is just the Rademacher complexity of G,. By de\ufb01nition,\n\n{G : kGk\u21e4 \uf8ff } = \u00b7 conv({uuT : |u| = 1}).\n\nE sup\n\nwhere conv(U ) is the convex hull of a set U. Since the Rademacher complexity of a set is the same\nas the Rademacher complexity of it\u2019s closed convex hull,\n\nG2G,Xt2S\n\n\u270ftLt! u\n\u270fthLt, Gi \uf8ff E sup\n|u|=1Xt2S\n\u270fthLt, uuTi\nwhich we recognize is just EkPt2S \u270ftLtk. By [16, 6.6.1] we can bound the operator norm\nkPt2S \u270ftLtk in terms of the variance ofPt2S L2\nt and the maximal eigenvalue of maxt Lt. These\n|S| r 18|S| log(n)\nEkXt2S\n\n|u|=1\nlog n! .\n\nare computed in Lemma 1 given in the supplemental materials. Combining these results gives,\n\nuT Xt2S\n\n\u270ftLtk \uf8ff\n\n= E sup\n\n2L\n|S|\n\np3\n3\n\n2L\n\n+\n\nn\n\nWe remark that if G is a rank d < n matrix then\n\nkGk\u21e4 \uf8ff\n\npdkGkF \uf8ff\n\npdnkGk1\n\nso if G? is low rank, we really only need a bound on the in\ufb01nity norm of our constraint set. Under\n\nthe assumption that G? is rank d with ||G?||1 \uf8ff and we set = pdn, then Theorem 1 implies\nthat for |S| > n log n/161\n\nR(bG) R(G?) \uf8ff 8Ls 18dn log(n)\n\n|S|\n\n+ Ls 288 log 2/\n\n|S|\n\nwith probability at least 1 . The above display says that |S| must scale like dn log(n) which is\nconsistent with known \ufb01nite sample bounds [5].\n\n3 Maximum Likelihood Embedding\nWe now turn our attention to recovering metric information about G?. Let S be a collection of\ntriplets sampled uniformly at random with replacement and let f : R ! (0, 1) be a known probability\nfunction governing the observations. Any link function f induces a natural loss function `f , namely,\nthe negative log-likelihood of a solution G given an observation yt de\ufb01ned as\n1\n\n1\n\n`f (ythLt, Gi) = 1yt=1 log(\n\nf (hLt,Gi) ) + 1yt=1 log(\n\n1f (hLt,Gi) )\n\nFor example, the logistic link function of (1) induces the logistic loss of (5). Recalling that P(yt =\n1) = f (hLt, Gi) we have\n\nE[`f (ythLt, Gi)] = f (hLt, G?i) log(\n\n1\n\nf (hLt,Gi) ) + (1 f (hLt, G?i) log(\n\n1\n\n1f (hLt,Gi) )\n\n= H(f (hLt, G?i)) + KL(f (hLt, G?i)|f (hLt, Gi))\np ) + (1 p) log( 1\n\n1p ) and KL(p, q) = p log( p\n\nwhere H(p) = p log( 1\n1q ) are the\nentropy and KL divergence of Bernoulli RVs with means p, q. Recall that ||G||1 \uf8ff controls the\nmagnitude of hLt, Gi so for the moment, assume this is small. Then by a Taylor series f (hLt, Gi) \u21e1\n2 + f0(0)hLt, Gi using the fact that f (0) = 1\n\n2, and by another Taylor series we have\n\nq ) + (1 p) log( 1p\n\n1\n\nKL(f (hLt, G?i)|f (hLt, Gi)) \u21e1 KL( 1\n\n2 + f0(0)hLt, G?i| 1\n\u21e1 2f0(0)2(hLt, G? Gi)2.\n\n2 + f0(0)hLt, Gi)\n\nR(G) = 1\n\nThus, recalling the de\ufb01nition of L(G) from (4) we conclude that if eG 2 arg minG R(G) with\n|T |Pt2T E[`f (ythLt, Gi)] then one would expect L(eG) \u21e1L (G?). Moreover, since\nbRS(G) is an unbiased estimator of R(G), one expects L(bG) to approximate L(G?). The next\n\ntheorem, combined with Theorem 1, formalizes this observation; its proof is found in the appendix.\n\n5\n\n\fTheorem 2. Let Cf = mint2T inf G2G, |f0hLt, Gi| where f0 denotes the derivative of f. Then\n\nfor any G\n\n2C2\nf\n\n|T | kL(G) L (G?)k2\n\nF \uf8ff R(G) R(G?) .\n\n1\n\n4 exp(6).\n\nt(D) := Dij Dik \u2318L t(G) .\n\n4 exp(6||G||1) for any t, G so it suf\ufb01ces to take Cf = 1\n\n2 -dimensional vector\n(D) = [D12 D13,\u00b7\u00b7\u00b7 , Dij Dik,\u00b7\u00b7\u00b7 ]T .\n\nNote that if f is the logistic link function of (1) then its straightforward to show that |f0hLt, Gi|\n4 exp(|hLt, Gi|) 1\nIt remains to see that we can recover G? even given L(G?), much less L(bG). To do this, it is more\nconvenient to work with distance matrices instead of Gram matrices. Analogous to the operators\nLt(G) de\ufb01ned above, we de\ufb01ne the operators t for t 2T satisfying,\nWe will view the t as linear operators on the space of symmetric hollow n \u21e5 n matrices Sn\nh, which\nincludes distance matrices as special cases. As with L, we can arrange all the t together, ordering\nthe t 2T lexicographically, to de\ufb01ne the nn1\nWe will use the fact that L(G) \u2318 (D) heavily. Because (D) consists of differences of matrix\nentries, has a non-trivial kernel. However, it is easy to see that D can be recovered given (D) and\nany one off-diagonal element of D, so the kernel is 1-dimensional. Also, the kernel is easy to identify\nby example. Consider the regular simplex in d dimensions. The distances between all n = d + 1\nvertices are equal and the distance matrix can easily be seen to be 11T I. Thus (D) = 0 in this\ncase. This gives us the following simple result.\nLemma 2. Let Sn\nh denote the space of symmetric hollow matrices, which includes all distance\n2 1\nmatrices. For any D 2 Sn\ndimensional subspace of Sn\nSo we see that the operator is not invertible on Sn\nh. De\ufb01ne J := 11T I. For any D, let C, the\ncentered distance matrix, be the component of D orthogonal to the kernel of L (i.e., tr(CJ) = 0).\nThen we have the orthogonal decomposition\n\nh, the set of linear functionals {t(D), t 2T } spans ann\nh, and the 1-dimensional kernel is given by the span of 11T I.\n\nwhere D = trace(DJ)/kJk2\ninterpretation:\n2 X1\uf8ffi\uf8ffj\uf8ffn\n2n\n\nD =\n\n1\n\nD = C + D J ,\n\nF . Since G is assumed to be centered, the value of D has a simple\n\nDij =\n\n2\n\nn 1 X1\uf8ffi\uf8ffn\n\nhxi, xii =\n\n2kGk\u21e4\nn 1\n\n,\n\n(9)\n\nthe average of the squared distances or alternatively a scaled version of the nuclear norm of G.\n\nclose to C?. The next theorem quanti\ufb01es this.\n\nsolution to 8. Though is not invertible on all Sn\n\nLet bD and bC be the corresponding distance and centered distance matrices corresponding to bG the\nthe kernel, namely J?. So if (bD) \u21e1 (D?), or equivalently L(bG) \u21e1L (G?), we expect bC to be\nTheorem 3. Consider the setting of Theorems 1 and 2 and let bC, C? be de\ufb01ned as above. Then\nfs 288 log 2/\n\nh, it is invertible on the subspace orthogonal to\n\nlog n! +\n\nL\n4C2\n\np3\n3\n\nL\n4C2\n\nF \uf8ff\n\nn\n\n1\n\n|S|\n\nProof. By combining Theorem 2 with the prediction error bounds obtainined in 1 we see that\n\n2kbC C?k2\n2n\n\n2C2\nf\n\n2 kL(bG) L (G?)k2\nnn1\n\nF \uf8ff\n\n+\n\nf|S| r 18|S| log(n)\n|S| r 18|S| log(n)\n\n4L\n\nn\n\np3\n3\n\n+\n\nlog n! + Ls 288 log 2/\n\n.\n\n|S|\n\nNext we employ the following restricted isometry property of on the subspace J? whose proof is\nin the supplementary materials.\n\n6\n\n\fLemma 3. Let D and D0 be two different distance matrices of n points in Rd and Rd0. Let C and\nC0 be the components of D and D0 orthogonal to J. Then\n\nnkC C0k2\n\nF \uf8ff k(C) (C0)k2 = k(D) (D0)k2 \uf8ff 2(n 1)kC C0k2\nF .\n\nThe result then follows.\n\nF \uf8ff O\u21e3 L\n\nC2\n\nThis implies that by collecting enough samples, we can recover the centered distance matrix. By\napplying the discussion following Theorem 1 when G? is rank d, we can state an upperbound of\n\n2)kbC C?k2\n1\n2(n\nrecover D? or G?. Remarkably, despite this unknown component being in the kernel, we show next\nthat it can be recovered.\nTheorem 4. Let D be a distance matrix of n points in Rd, let C be the component of D orthogonal\nto the kernel of L, and let 2(C) denote the second largest eigenvalue of C. If n > d + 2, then\n\n\u2318. However, it is still not clear that this is enough to\n\nfq dn log(n)+log(1/)\n\n|S|\n\nD = C + 2(C) J .\n\n(10)\n\nThis shows that D is uniquely determined as a function of C. Therefore, since (D) =( C) and\nbecause C is orthogonal to the kernel of , the distance matrix D can be recovered from (D),\neven though the linear operator is non-invertible.\nWe now provide a proof of Theorem 4 in the case where n > d + 3. The result is true in the case\nwhen n > d + 2 but requires a more detailed analysis. This includes the construction of a vector x\nsuch that Dx = 1 and 1T x 0 for any distance matrix a result in [17].\nProof. To prove Theorem 4 we need the following lemma, proved in the supplementary materials.\nLemma 4. Let D be a Euclidean distance matrix on n points. Then D is negative semide\ufb01nite on\nthe subspace\n\n1? := {x 2 Rn|1T x = 0}.\n\nFurthermore, ker(D) \u21e2 1?.\nFor any matrix M, let i(M ) denote its ith largest eigenvalue. Under the conditions of the theorem,\nwe show that for > 0, 2(D J) = . Since C = D DJ, this proves the theorem.\nNote that, i(D J) = i(D 11T ) + for 1 \uf8ff i \uf8ff n and arbitrary. So it suf\ufb01ces to show\nthat 2(D 11T ) = 0.\nBy Weyl\u2019s Theorem\n\nSince 1(11T ) = 0, we have 2(D 11T ) \uf8ff 2(D) = 0. By the Courant-Fischer Theorem\n\n2(D 11T ) \uf8ff 2(D) + 1(11T ) .\n\n2(D) =\n\nmin\n\nU :dim(U )=n1\n\nmax\n\nx2U,x6=0\n\nxT Dx\nxT x \uf8ff min\n\nU =1?\n\nmax\n\nx2U,x6=0\n\nxT Dx\nxT x \uf8ff 0\n\nsince D negative semide\ufb01nite on 1?. Now let vi denote the ith eigenvector of D with eigenvalue\ni = 0. Then\n\n(D 11T )vi = Dvi = 0 ,\n\nsince vT\ni 1 = 0 by 4. So D 11T has at least n d 2 zero eigenvalues, since rankD \uf8ff d + 2.\nIn particular, if n > d + 3, then D 11T must have at least two eigenvalues equal to 0. Therefore,\n2(D 11T ) = 0.\nThe previous theorem along with Theorem 3 guarantees that we can recover G? as we increase\nthe number of triplets sampled. The \ufb01nal theorem, which follows directly from Theorems 3 and 4,\nsummarizes this.\nTheorem 5. Assume n > d + 2 and consider the setting of Theorems 1 and 2. As |S| ! 1,\nbD ! D\u21e4 where bD is the distance matrix corresponding to bG (the solution to 8).\nProof. Recall bD = bC + 2(bC)J, so as bC ! C\u21e4, bD ! D\u21e4.\n\n7\n\n\fFigure 1: G? generated with n = 64 points in d = 2 and d = 8 dimensions on the left and right.\n\n4 Experimental Study\n\nline search, projecting onto the subspace spanned by the top d eigenvalues at each step (i.e. setting the\n\nfor any triplet t = (i, j, k).We report the prediction error on a holdout set of 10, 000 triplets and\nthe error in Frobenius norm of the estimated Gram matrix over 36 random trials. We minimize the\n\n2\u21e4 \uf8ff E\u21e5kxi xjk2\n2 || xi xk||2\n|S|Pt2S log(1 + exp(ythLt, Gi)).\n\nThe section empirically studies the properties of estimators suggested by our theory. It is not an\nattempt to perform an exhaustive empirical evaluation of different embedding techniques; for that see\n2d Id) 2 Rd,\n[18, 4, 6, 3]. In what follows each of the n points is generated randomly: xi \u21e0N (0, 1\ni = 1, . . . , n, motivated by the observation that\n2\u21e4 = 2E\u21e5kxik2\n2\u21e4 = 1\n\nE[|hLt, G?i|] = E\u21e5kxi xjk2\nlogistic MLE objective bRS(G) = 1\nFor each algorithm considered, the domain of the objective variable G is the space of symmetric\npositive semi-de\ufb01nite matrices. None of the methods impose the constraint maxij |Gij|\uf8ff (as\ndone above), since this was used to simplify the analysis and does not have a large impact in practice.\nRank-d Projected Gradient Descent (PGD) performs gradient descent on the objective bRS(G) with\nsmallest n d eigenvalues to 0). Nuclear Norm PGD performs gradient descent on bRS(G) projecting\nonto the nuclear norm ball with radius kG?k\u21e4, where G? is the Gram matrix of the latent embedding.\nThe nuclear norm projection can have the undesirable effect of shrinking the non-zero eigenvalues\ntoward the origin. To compensate for this potential bias, we employ Nuclear Norm PGD Debiased,\nwhich takes the biased output of Nuclear Norm PGD, decomposes it into U EU T where U 2 Rn\u21e5d\nare the top d eigenvectors, and outputs Udiag(bs)U T wherebs = arg mins2Rd bRS(Udiag(s)U T ).\nThis last algorithm is motivated by the observation that methods for minimizing k\u00b7k 1 or k\u00b7k \u21e4 are\ngood at identifying the true support of a signal, but output biased magnitudes [19]. Rank-d PGD and\nNuclear Norm PGD Debiased are novel ordinal embedding algorithms.\nFigure 1 presents how the algorithms behave for n = 64 and d = 2, 8. We observe that the unbiased\nnuclear norm solution behaves near-identically to the rank-d solution and remark that this was\nobserved in all of our experiments (see the supplementary materials for other values of n, d, and\nscalings of G?). A popular technique for recovering rank d embeddings is to perform (stochastic)\n\nIn all of our experiments this method produced Gram matrices nearly identical to those produced by\nour Rank-d-PGD method, but Rank-d-PGD was an order of magnitude faster in our implementation.\n\ngradient descent on bRS(U T U ) with objective variable U 2 Rn\u21e5d taken as the embedding [18, 4, 6].\nAlso, in light of our isometry theorem, we can show that the Hessian of E[bRS(G)] is nearly a scaled\n\nidentity, leading us to hypothesize that a globally optimal linear convergence result for this non-\nconvex optimization may be possible using the techniques of [20, 21]. Finally, we note that previous\nliterature has reported that nuclear norm optimizations like Nuclear Norm PGD tend to produce less\naccurate embeddings than those of non-convex methods [4, 6]. The results imply that Nuclear Norm\nPGD Debiased appears to close the performance gap between the convex and non-convex solutions.\nAcknowledgments This work was partially supported by the NSF grants CCF-1218189 and IIS-\n1447449, the NIH grant 1 U54 AI117924-01, the AFOSR grant FA9550-13-1-0138, and by ONR\nawards N00014-15-1-2620, and N00014-13-1-0129. We would also like to thank Amazon Web\nServices for providing the computational resources used for running our simulations.\n\n8\n\n\fReferences\n[1] Roger N Shepard. The analysis of proximities: Multidimensional scaling with an unknown\n\ndistance function. i. Psychometrika, 27(2):125\u2013140, 1962.\n\n[2] Joseph B Kruskal. Nonmetric multidimensional scaling: a numerical method. Psychometrika,\n\n29(2):115\u2013129, 1964.\n\n[3] Sameer Agarwal, Josh Wills, Lawrence Cayton, Gert Lanckriet, David J Kriegman, and Serge\nBelongie. Generalized non-metric multidimensional scaling. In International Conference on\nArti\ufb01cial Intelligence and Statistics, pages 11\u201318, 2007.\n\n[4] Omer Tamuz, Ce Liu, Ohad Shamir, Adam Kalai, and Serge J Belongie. Adaptively learning\nthe crowd kernel. In Proceedings of the 28th International Conference on Machine Learning\n(ICML-11), pages 673\u2013680, 2011.\n\n[5] Kevin G Jamieson and Robert D Nowak. Low-dimensional embedding using adaptively selected\nordinal data. In Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton\nConference on, pages 1077\u20131084. IEEE, 2011.\n\n[6] Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In Machine\nLearning for Signal Processing (MLSP), 2012 IEEE International Workshop on, pages 1\u20136.\nIEEE, 2012.\n\n[7] Brian McFee and Gert Lanckriet. Learning multi-modal similarity. The Journal of Machine\n\nLearning Research, 12:491\u2013523, 2011.\n\n[8] Matth\u00e4us Kleindessner and Ulrike von Luxburg. Uniqueness of ordinal embedding. In COLT,\n\npages 40\u201367, 2014.\n\n[9] Yoshikazu Terada and Ulrike V Luxburg. Local ordinal embedding. In Proceedings of the 31st\n\nInternational Conference on Machine Learning (ICML-14), pages 847\u2013855, 2014.\n\n[10] Ery Arias-Castro. Some theory for ordinal embedding. arXiv preprint arXiv:1501.02861, 2015.\n[11] Mark A Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters. 1-bit matrix\n\ncompletion. Information and Inference, 3(3), 2014.\n\n[12] Yu Lu and Sahand N Negahban. Individualized rank aggregation using nuclear norm regulariza-\ntion. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing\n(Allerton), pages 1473\u20131479. IEEE, 2015.\n\n[13] D. Park, J , Neeman, J. Zhang, S. Sanghavi, and I. Dhillon. Preference completion: Large-scale\ncollaborative ranking from pairwise comparisons. Proc. Int. Conf. Machine Learning (ICML),\n2015.\n\n[14] Jon Dattorro. Convex Optimization & Euclidean Distance Geometry. Meboo Publishing USA,\n\n2011.\n\n[15] St\u00e9phane Boucheron, Olivier Bousquet, and G\u00e1bor Lugosi. Theory of classi\ufb01cation: A survey\n\nof some recent advances. ESAIM: probability and statistics, 9:323\u2013375, 2005.\n\n[16] Joel A. Tropp. An introduction to matrix concentration inequalities, 2015.\n[17] Pablo Tarazaga and Juan E. Gallardo. Euclidean distance matrices: new characterization and\n\nboundary properties. Linear and Multilinear Algebra, 57(7):651\u2013658, 2009.\n\n[18] Kevin G Jamieson, Lalit Jain, Chris Fernandez, Nicholas J Glattard, and Rob Nowak. Next: A\nsystem for real-world development, evaluation, and application of active learning. In Advances\nin Neural Information Processing Systems, pages 2638\u20132646, 2015.\n\n[19] Nikhil Rao, Parikshit Shah, and Stephen Wright. Conditional gradient with enhancement and\n\ntruncation for atomic norm regularization. In NIPS workshop on Greedy Algorithms, 2013.\n\n[20] Samet Oymak, Benjamin Recht, and Mahdi Soltanolkotabi. Sharp time\u2013data tradeoffs for linear\n\ninverse problems. arXiv preprint arXiv:1507.04793, 2015.\n\n[21] Jie Shen and Ping Li. A tight bound of hard thresholding. arXiv preprint arXiv:1605.01656,\n\n2016.\n\n9\n\n\f", "award": [], "sourceid": 1382, "authors": [{"given_name": "Lalit", "family_name": "Jain", "institution": "University of Michigan"}, {"given_name": "Kevin", "family_name": "Jamieson", "institution": "UC Berkeley"}, {"given_name": "Rob", "family_name": "Nowak", "institution": "University of Wisconsin Madison"}]}