{"title": "Learning Nearest Neighbor Graphs from Noisy Distance Samples", "book": "Advances in Neural Information Processing Systems", "page_first": 9586, "page_last": 9596, "abstract": "We consider the problem of learning the nearest neighbor graph of a dataset of n items. The metric is unknown, but we can query an oracle to obtain a noisy estimate of the distance between any pair of items. This framework applies to problem domains where one wants to learn people's preferences from responses commonly modeled as noisy distance judgments. In this paper, we propose an active algorithm to find the graph with high probability and analyze its query complexity. In contrast to existing work that forces Euclidean structure, our method is valid for general metrics, assuming only symmetry and the triangle inequality. Furthermore, we demonstrate efficiency of our method empirically and theoretically, needing only O(n\\log(n)\\Delta^{-2}) queries in favorable settings, where \\Delta^{-2} accounts for the effect of noise. Using crowd-sourced data collected for a subset of the UT~Zappos50K dataset, we apply our algorithm to learn which shoes people believe are most similar and show that it beats both an active baseline and ordinal embedding.", "full_text": "Learning Nearest Neighbor Graphs from Noisy\n\nDistance Samples\n\nBlake Mason \u21e4\n\nUniversity of Wisconsin\n\nMadison, WI 53706\nbmason3@wisc.edu\n\nArdhendu Tripathy \u21e4\nUniversity of Wisconsin\n\nMadison, WI 53706\n\nastripathy@wisc.edu\n\nRobert Nowak\n\nUniversity of Wisconsin\n\nMadison, WI 53706\nrdnowak@wisc.edu\n\nAbstract\n\nWe consider the problem of learning the nearest neighbor graph of a dataset of n\nitems. The metric is unknown, but we can query an oracle to obtain a noisy estimate\nof the distance between any pair of items. This framework applies to problem\ndomains where one wants to learn people\u2019s preferences from responses commonly\nmodeled as noisy distance judgments. In this paper, we propose an active algorithm\nto \ufb01nd the graph with high probability and analyze its query complexity. In contrast\nto existing work that forces Euclidean structure, our method is valid for general\nmetrics, assuming only symmetry and the triangle inequality. Furthermore, we\ndemonstrate ef\ufb01ciency of our method empirically and theoretically, needing only\nO(n log(n)2) queries in favorable settings, where 2 accounts for the effect\nof noise. Using crowd-sourced data collected for a subset of the UT Zappos50K\ndataset, we apply our algorithm to learn which shoes people believe are most\nsimilar and show that it beats both an active baseline and ordinal embedding.\n\n1\n\nIntroduction\n\nIn modern machine learning applications, we frequently seek to learn proximity/ similarity relation-\nships between a set of items given only noisy access to pairwise distances. For instance, practitioners\nwishing to estimate internet topology frequently collect one-way-delay measurements to estimate the\ndistance between a pair of hosts [9]. Such measurements are affected by physical constraints as well as\nserver load, and are often noisy. Researchers studying movement in hospitals from WiFi localization\ndata likewise contend with noisy distance measurements due to both temporal variability and varying\nsignal strengths inside the building [4]. Additionally, human judgments are commonly modeled as\nnoisy distances [26, 23]. As an example, Amazon Discover asks customers their preferences about\ndifferent products and uses this information to recommend new items it believes are similar based\non this feedback. We are often primarily interested in the closest or most similar item to a given\none\u2013 e.g., the closest server, the closest doctor, the most similar product. The particular item of\ninterest may not be known a priori. Internet traf\ufb01c can \ufb02uctuate, different patients may suddenly need\nattention, and customers may be looking for different products. To handle this, we must learn the\nclosest/ most similar item for each item. This paper introduces the problem of learning the Nearest\nNeighbor Graph that connects each item to its nearest neighbor from noisy distance measurements.\nProblem Statement: Consider a set of n points X = {x1,\u00b7\u00b7\u00b7 , xn} in a metric space. The metric\nis unknown, but we can query a stochastic oracle for an estimate of any pairwise distance. In as few\nqueries as possible, we seek to learn a nearest neighbor graph of X that is correct with probability\n1 , where each xi is a vertex and has a directed edge to its nearest neighbor xi\u21e4 2X \\ { xi}.\n\u21e4Authors contributed equally to this paper and are listed alphabetically.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1.1 Related work\nNearest neighbor problems (from noiseless measurements) are well studied and we direct the reader to\n[3] for a survey. [6, 30, 25] all provide theory and algorithms to learn the nearest neighbor graph which\napply in the noiseless regime. Note that the problem in the noiseless setting is very different. If noise\ncorrupts measurements, the methods from the noiseless setting can suffer persistent errors. There has\nbeen recent interest in introducing noise via subsampling for a variety of distance problems [24, 1, 2],\nthough the noise here is not actually part of the data but introduced for ef\ufb01ciency. In our algorithm,\nwe use the triangle inequality to get tighter estimates of noisy distances in a process equivalent to the\nclassical Floyd\u2013Warshall [11, 7]. This has strong connections to the metric repair literature [5, 13]\nwhere one seeks to alter a set of noisy distance measurements as little as possible to learn a metric\nsatisfying the standard axioms. [27] similarly uses the triangle inequality to bound unknown distances\nin a related but noiseless setting. In the speci\ufb01c case of noisy distances corresponding to human\njudgments, a number of algorithms have been proposed to handle related problems, most notably\nEuclidean embedding techniques, e.g. [17, 31, 23]. To reduce the load on human subjects, several\nattempts at an active method for learning Euclidean embeddings have been made but have only seen\nlimited success [20]. Among the culprits is the strict and often unrealistic modeling assumption that\nthe metric be Euclidean and low dimensional. In the particular case that the algorithm may query\ntriplets (e.g., \u201cis i or j closer to k?\u201d) and receive noisy responses, [22] develop an interesting, passive\ntechnique under general metrics for learning a relative neighborhood graph which is an undirected\nrelaxation of a nearest neighbor graph.\n\n1.2 Main contributions\nIn this paper we introduce the problem of identifying the nearest neighbor graph from noisy distance\nsamples and propose ANNTri, an active algorithm, to solve it for general metrics. We empirically and\ntheoretically analyze its complexity to show improved performance over a passive and an active base-\nline. In favorable settings, such as when the data forms clusters, ANNTri needs only O(n log(n)/2)\nqueries, where accounts for the effect of noise. Furthermore, we show that ANNTri achieves\nsuperior performance compared to methods which require much stronger assumptions. We highlight\ntwo such examples. In Fig. 2c, for an embedding in R2, ANNTri outperforms the common technique\nof triangulation that works by estimating each point\u2019s distance to a set of anchors. In Fig. 3b, we\nshow that ANNTri likewise outperforms Euclidean embedding for predicting which images are most\nsimilar from a set of similarity judgments collected on Amazon Mechanical Turk. The rest of the\npaper is organized as follows. In Section 2, we further setup the problem. In Sections 3 and 4 we\npresent the algorithm and analyze its theoretical properties. In Section 5 we show ANNTri\u2019s empirical\nperformance on both simulated and real data. In particular, we highlight its ef\ufb01ciency in learning\nfrom human judgments.\n\nQ(i, j)\n\ndi,j + \u2318,\n\nyields a realization of\n\n2 Problem setup and summary of our approach\nWe denote distances as di,j where d : X\u21e5X! R0 is a distance function satisfying the standard\naxioms and de\ufb01ne xi\u21e4 := arg minx2X\\{xi} d(xi, x). Though the distances are unknown, we are able\nto draw independent samples of its true value according to a stochastic distance oracle, i.e. querying\n(1)\nwhere \u2318 is a zero-mean subGaussian random variable assumed to have scale parameter = 1. We let\n\u02c6di,j(t) denote the empirical mean of the values returned by Q(i, j) queries made until time t. The\nnumber of Q(i, j) queries made until time t is denoted as Ti,j(t). A possible approach to obtain the\n\n2 pairs and report xi\u21e4(t) = arg minj6=i \u02c6di,j(t) for\nnearest neighbor graph is to repeatedly query alln\nall i 2 [n]. But since we only wish to learn xi\u21e48i, if di,k di,i\u21e4, we do not need to query Q(i, k)\nas many times as Q(i, i\u21e4). To improve our query ef\ufb01ciency, we could instead adaptively sample to\nfocus queries on distances that we estimate are smaller. A simple adaptive method to \ufb01nd the nearest\nneighbor graph would be to iterate over x1, x2, . . . , xn and use a best-arm identi\ufb01cation algorithm\nto \ufb01nd xi\u21e4 in the ith round.1 However, this procedure treats each round independently, ignoring\nproperties of metric spaces that allow information to be shared between rounds.\n\n1We could also proceed in a non-iterative manner, by adaptively choosing which amongn\n\nnext. However this has worse empirical performance and same theoretical guarantees as the in-order approach.\n\n2 pairs to query\n\n2\n\n\f\u2022 Due to symmetry, for any i < j the queries Q(i, j) and Q(j, i) follow the same law, and we\n\ncan reuse values of Q(i, j) collected in the ith round while \ufb01nding xj\u21e4 in the jth round.\n\n\u2022 Using concentration bounds on di,j and di,k from samples from Q(i, j) and Q(i, k) collected\nin the ith round, we can bound dj,k via the triangle inequality. As a result, we may be able\nto state xk 6= xj\u21e4 without even querying Q(j, k).\n\nOur proposed algorithm ANNTri uses all the above ideas to \ufb01nd the nearest neighbor graph of X . For\ngeneral X , the sample complexity of ANNTri contains a problem-dependent term that involves the\norder in which the nearest neighbors are found. For an X consisting of suf\ufb01ciently well separated\nclusters, this order-dependence for the sample complexity does not exist.\n\n3 Algorithm\n\nOur proposed algorithm (Algorithm 1) ANNTri \ufb01nds the nearest neighbor graph of X with probability\n1 . It iterates over xj 2X in order of their subscript index and \ufb01nds xj\u21e4 in the jth \u2018round\u2019. All\nbounds, counts of samples, and empirical means are stored in n \u21e5 n symmetric matrices in order\nto share information between different rounds. We use Python array/Matlab notation to indicate\nindividual entries in the matrices, for e.g., \u02c6d[i, j] = \u02c6di,j(t). The number of Q(i, j) queries made is\nqueried is stored in the (i, j)th entry of T . Matrices U and L record upper and lower con\ufb01dence\nbounds on di,j. U4 and L4 record the associated triangle inequality bounds. Symmetry is ensured\nby updating the (j, i)th entry at the same time as the (i, j)th entry for each of the above matrices. In\nthe jth round, ANNTri \ufb01nds the correct xj\u21e4 with probability 1 /n by calling SETri (Algorithm 2),\na modi\ufb01cation of the successive elimination algorithm for best-arm identi\ufb01cation. In contrast to\nstandard successive elimination, at each time step SETri only samples those points in the active set\nthat have the fewest number of samples.\n\nas n \u21e5 n matrices where each entry is 1, NN as a length n array\n\nfor i = 1 to n do {\ufb01nd tightest triangle bounds}\n\nAlgorithm 1 ANNTri\nRequire: n, procedure SETri (Alg. 2), con\ufb01dence \n1: Initialize \u02c6d, T as n\u21e5 n matrices of zeros, U, U4 as n\u21e5 n matrices where each entry is 1, L, L4\n2: for j = 1 to n do\n3:\n4:\ni,k , see (7)\n5:\ni,k , see (8)\n6:\n7: NN[j] = SETri(j, \u02c6d, U, U4, L, L4, T,\u21e0 = /n)\n8: return The nearest neighbor graph adjacency list NN\n\nfor all k 6= i do\nSet U4[i, k], U4[k, i], min` U4`\nSet L4[i, k], L4[k, i] max` L4`\n\ncon\ufb01dence \u21e0\n\nAlgorithm 2 SETri\nRequire: index j, callable oracle Q(\u00b7,\u00b7) (Eq. (1)), six n \u21e5 n matrices: \u02c6d, U, U4, L, L4, T ,\n1: Initialize active set Aj { a 6= j : max{L[a, j], L4[a, j]} < mink min{U [j, k], U4[j, k]}}\n2: while |Aj| > 1 do\nfor all i 2A j such that T [i, j] = mink2Aj T [i, k] do {only query points with fewest samples}\n3:\nUpdate \u02c6d[i, j], \u02c6d[j, i] ( \u02c6d[i, j] \u00b7 T [i, j] + Q(i, j))/(T [i, j] + 1)\n4:\nUpdate T [i, j], T [j, i] T [i, j] + 1\n5:\nUpdate U [i, j], U [j, i] \u02c6d[i, j] + C\u21e0(T [i, j])\n6:\nUpdate L[i, j], L[j, i] \u02c6d[i, j] C\u21e0(T [i, j])\n7:\nUpdate Aj { a 6= j : max{L[a, j], L4[a, j]} < mink min{U [j, k], U4[j, k]}}\n8:\n9: return The index i for which xi 2A j\n\n3\n\n\f3.1 Con\ufb01dence bounds on the distances\nUsing the subGaussian assumption on the noise random process, we can use Hoeffding\u2019s inequality\nand a union bound over time to get the following con\ufb01dence intervals on the distance dj,k:\n\n| \u02c6dj,k(t) dj,k|\uf8ff s2\n\nlog(4n2(Tj,k(t))2/)\n\nTj,k(t)\n\n=: C/n(Tj,k(t)),\n\n(2)\n\nwhich hold for all points xk 2X \\ { xj} at all times t with probability 1 /n, i.e.\n\n(3)\nwhere Li,j(t) := \u02c6di,j(t) C/n(Ti,j(t)) and Ui,j(t) := \u02c6di,j(t) + C/n(Ti,j(t)). [10] use the above\nprocedure to derive the following upper bound for the number of oracle queries used to \ufb01nd xj\u21e4:\n\nP(8t 2 N,8i 6= j, di,j 2 [Li,j(t), Ui,j(t)]) 1 /n,\n\nO0@Xk6=j\n\nlog(n2/(j,k))\n\n2\n\nj,k\n\n1A ,\n\n(4)\n\nwhere for any xk /2{ xj, xj\u21e4} the suboptimality gap j,k := dj,k dj,j\u21e4 characterizes how hard it\nis to rule out xk from being the nearest neighbor. We also set j,j\u21e4 := mink /2{j,j\u21e4} j,k. Note that\none can use tighter con\ufb01dence bounds as detailed in [12] and [18] to obtain sharper bounds on the\nsample complexity of this subroutine.\n\n3.2 Computing the triangle bounds and active set Aj(t)\nSince Aj(\u00b7) is only computed within SETri, we abuse notation and use its argument t to indicate\nthe time counter private to SETri. Thus, the initial active set computed by SETri when called in\nthe jth round is denoted Aj(0). During the jth round, the active set Aj(t) contains all points that\nhave not been eliminated from being the nearest neighbor of xj at time t. In what follows, we add a\nsuperscript 4 to denote a bound obtained via the triangle inequality, whose precise de\ufb01nitions are\ngiven in Lemma 3.1. We de\ufb01ne xj\u2019s active set at time t as\n\nAj(t) := {a 6= j : max{La,j(t), L4a,j(t)} < min\n\nk\n\nmin{Uj,k(t), U4j,k(t)}}.\n\n(5)\n\nAssuming L4a,j(t) and U4j,k(t) are valid lower and upper bounds on da,j, dj,k respectively, (5) states\nthat point xa is active if its lower bound is less than the minimum upper bound for dj,k over all\nchoices of xk 6= xj. Next, for any (j, k) we construct triangle bounds L4, U4 on the distance dj,k.\nIntuitively, for some reals g, g0, h, h0, if di,j 2 [g, g0] and di,k 2 [h, h0] then dj,k \uf8ff g0 + h0, and\ndj,k | di,j di,k| = max{di,j, di,k} min{di,j, di,k} (max{g, h} min{g0, h0})+\n(6)\nwhere (s)+:= max{s, 0}. The lower bound can be seen as true by Fig. 7 in the Appendix. Lemma 3.1\nuses these ideas to form upper and lower bounds on distances by the triangle inequality. Note that\nthis de\ufb01nition is inherently recursive as it may rely on past triangle inequality bounds to achieve the\ntightest possible result. We denote a triangle inequality upper and lower bounds on dj,k due to a point\ni at time t as U4i\nLemma 3.1. For all k 6= 1, U41\nmin\n\n1,k (t) = U41,k(t) = U1,k(t). For any i < j de\ufb01ne\ni,j (t)} + min{Ui,k(t), U4i2\n\n(min{Ui,j(t), U4i1\n\nj,k respectively.\n\nj,k and L4i\n\ni,k (t)}).\n\nj,k (t) :=\n\nU4i\n\n(7)\n\nmax{i1,i2}*j\n\n1[Aj,k]Hj,k +Xk j. In general Hj,k\ncalls are necessary, unless a triangle inequality bound allows for elimination of k without sampling,\nas given by 1[Aj,k]. The second term bounds the number of calls to Q(j, k) for all k < j. It has\nthe same form as the \ufb01rst term, except we must now use past samples we may already have via\nsymmetry of distances (provided the triangle inequality did not prevent us from querying Q(k, j) in\nthe previous round). The (\u00b7)+ operation prevents negative terms, since it may be the case that no\nadditional samples are necessary, even if we don\u2019t use the triangle inequality for elimination.\nIn Theorem B.6, in the Appendix, we state the sample complexity when triangle inequality bounds\nare ignored by ANNTri, and this upper bounds (11). Whether a point can be eliminated by the triangle\ninequality depends both on the underlying distances and the order in which ANNTri \ufb01nds each\nnearest neighbor (c.f. Lemma 4.3). In general, this dependence on the order is necessary to ensure\nthat past samples exist and may be used to form upper and lower bounds. Furthermore, it is worth\nnoting that even without noise the triangle inequality may not always help. A simple example is any\narrangement of points such that 0 < r \uf8ff dj,k < 2r 8j, k. To see this, consider triangle bounds on\nany distance dj,k due to any xi, xi0 2 X\\{xj, xk}. Then |di,j di,k|\uf8ff r < 2r \uf8ff di0,j + di0,k 8i, i0\nso L4i,j < U4j,k 8i, j, k. Thus no triangle upper bounds separate from triangle lower bounds so no\nelimination via the triangle inequality occurs. In such cases, it is necessary to sample all O(n2)\ndistances. However, in more favorable settings where data may be split into clusters, the sample\ncomplexity can be much lower by using triangle inequality.\nThe order in which {xi\u21e4} are found follows their subscript index, which is randomly chosen and\n\ufb01xed before starting the algorithm. As described above, different orders in which {xi} are processed\ncan affect the query complexity of our algorithm. The best order that minimizes the total number of\nqueries made in general depends on the true distance values. Even if the oracle is noiseless, there are\ndatasets where the pair (i, j) with the smallest di,j must be queried within the \ufb01rst n queries in order\nto identify the NN-graph using the minimum number of queries. Since this requirement cannot be\nensured by any algorithm that only has access to information via a distance oracle, it is not possible\nto achieve the minimum number of queries in such examples.\n\n4.3 Adaptive gains via the triangle inequality\nWe highlight two settings where ANNTri provably achieves sample complexity better than O(n2)\nindependent of the order of the rounds. Consider a dataset containing c clusters of n/c points each as\nin Fig. 1a. Denote the mth cluster as Cm and suppose the distances between the points are such that\n(12)\n\n{xk : di,k < 6C/n(1) + 2di,j}\u2713C m 8i, j 2C m.\n\n6\n\n\fThe above condition is ensured if the distance between any two points belonging to different clusters\nis at least a (, n)-dependent constant plus twice the diameter of any cluster.\nTheorem 4.5. Consider a dataset of pn clusters which satisfy the condition in (12). Then ANNEasy\nlearns the correct nearest neighbor graph of X with probability at least 1 in\n\n(13)\n\nO\u21e3n3/22\u2318\n\nn3/2Ppn\n\ni=1Pj,k2Ci\n\nmin) where 2\n\nlog(n2/(j,k))2\n\nj,k is the average number of samples\n\nmin := minj,k log(n2/(j,k))2\n\nqueries where 2 := 1\ndistances between points in the same cluster.\nBy contrast, random sampling requires O(n22\nj,k \n2. In fact, the value in (11) can be even lower if unions of clusters also satisfy (12). In this case,\nthe triangle inequality can be used to separate groups of clusters. For example, in Fig. 1b, if C1 [C 2\nand C3 [C 4 satisfy (12) along with C1,\u00b7\u00b7\u00b7 ,C4, then the triangle inequality can separate C1 [C 2\nand C3 [C 4. This process can be generalized to consider a dataset that can be split recursively into\nsubclusters following a binary tree of k levels. At each level, the clusters are assumed to satisfy (12).\nWe refer to such a dataset as hierarchical in (12).\nTheorem 4.6. Consider a dataset X = [n/\u232b\ni=1Ci of n/\u232b clusters of size \u232b = O(log(n)) that is\nhierarchical in (12). Then ANNEasy learns the correct nearest neighbor graph of X with probability\nat least 1 in\nO\u21e3n log(n)2\u2318\n(14)\n\nn\u232bPn/\u232b\n\ni=1Pj,k2Ci\n\nlog(n2/(j,k))2\n\nj,k is the average number of samples\n\nqueries where 2 := 1\ndistances between points in the same cluster.\nExpression (14) matches known lower bounds of O(n log(n)) on the sample complexity for learning\nthe nearest neighbor graph from noiseless samples [30], the additional penalty of 2 is due to the\neffect of noise in the samples. An easy way to see the lower bound is to consider the fact that there are\nO(nn1) unique nearest neighbor graphs so any algorithm will require O(log(nn1)) = O(n log(n))\nbits of information to identify the correct one. In Appendix C, we state the sample complexity in the\naverage case, as opposed to the high probability statements above. The analog of the cluster condition\n(12) there does not involve constants and is solely in terms of pairwise distances (c.f. (33)).\n\n5 Experiments\n\nHere we evaluate the performance of ANNTri on simulated and real data. To construct the tightest\npossible con\ufb01dence bounds for SETri, we use the law of the iterated logarithm as in [18] with\nparameters \u270f = 0.7 and = 0.1. Our analysis bounds the number of queries made to the oracle.\nWe visualize the performance by tracking the empirical error rate with the number of queries made\nper point. For a given point xi, we say that a method makes an error at the tth sample if it fails\nto return xi\u21e4 as the nearest neighbor, that is, xi\u21e4 6= arg minj \u02c6d[i, j]. Throughout, we will compare\nANNTri against random sampling. Additionally, to highlight the effect of the triangle inequality, we\nwill compare our method against the same active procedure, but ignoring triangle inequality bounds\n(referred to as ANN in plots). All baselines may reuse samples via symmetry as well. We plot all\ncurves with 95% con\ufb01dence regions shaded.\n\n5.1 Simulated Experiments\n\nWe test the effectiveness of our method, we generate an embedding of 10 clusters of 10 points spread\naround a circle such that each cluster is separated by at least 10% of its diameter in R2 as in shown\nin Fig. 2a. We consider Gaussian noise with = 0.1. In Fig. 2b, we present average error rates of\nANNTri, ANN, and Random plotted on a log scale. ANNTri quickly learns xi\u21e4 and has lower error with\n0 samples due to initial elimination by the triangle inequality. The error curves are averaged over\n4000 repetitions. All rounds were capped at 105 samples for ef\ufb01ciency.\n\n7\n\n\f(a) Example embedding\n\n(b) Error curves\n\n(c) Comparison to triangulation\n\nFigure 2: Comparison of ANNTri to ANN and Random for 10 clusters of 10 points separated by 10%\nof their diameter with = 0.1. ANNTri identi\ufb01es clusters of nearby points more easily.\n\n5.1.1 Comparison to triangulation\nAn alternative way a practitioner may use to obtain the nearest neighbor graph might be to estimate\ndistances with respect to a few anchor points and then triangulate to learn the rest. [9] provide a\ncomprehensive example, and we summarize in Appendix A.2 for completeness. The triangulation\nmethod is na\u00efve for two reasons. First, it requires much stronger modeling assumptions than ANNTri\u2014\nnamely that the metric is Euclidean and the points are in a low-dimensional of known dimension.\nForcing Euclidean structure can lead to unpredictable errors if the underlying metric might not be\nEuclidean, such as in data from human judgments. Second, this procedure may be more noise\nsensitive because it estimates squared distances. In the example in Section A.2, this leads to the\nadditive noise being sub-exponential rather than subGaussian. In Fig. 2c, we show that even in a\nfavorable setting where distances are truly sampled from a low-dimensional Euclidean embedding and\npairwise distances between anchors are known exactly, triangulation still performs poorly compared\nto ANNTri. We consider the same 2-dimensional embedding of points as in Fig. 2a for a noise\nvariance of = 1 and compare the ANNTri and triangulation for different numbers of samples.\n\n5.2 Human judgment experiments\n5.2.1 Setup\nHere we consider the problem of learning from human judgments. For this experiment, we used a\nset X of 85 images of shoes drawn from the UT Zappos50k dataset [32, 33] and seek to learn which\nshoes are most visually similar. To do this, we consider queries of the form \u201cbetween i, j, and k,\nwhich two are most similar?\u201d. We show example queries in Figs. 5a and 5b in the Appendix. Each\nquery maps to a pair of triplet judgments of the form \u201cis j or k more similar to i?\u201d. For instance, if\ni and j are chosen, then we may imply the judgments \u201ci is more similar to j than to k\u201d and \u201cj is\nmore similar to i than to k\u201d. We therefore construct these queries from a set of triplets collected from\n\nparticipants on Mechanical Turk by [15]. The set contains multiple samples of all 8584\n\ntriples so that the probability of any triplet response can be estimated. We expect that i\u21e4 is most\ncommonly selected as being more similar to i than any third point k. We take distance to correspond\nto the fraction of times that two images i, j are judged as being more similar to each other than a\ndifferent pair in a triplet query (i, j, k). Let Ej\ni,k be the event that the pair i, k are chosen as most\nsimilar amongst i, j, and k. Accordingly, we de\ufb01ne the \u2018distance\u2019 between images i and j as\n\n2 unique\n\ndi,j := Ek\u21e0Unif(X\\{i,j})E[1\n\ni,k|k]\nEj\n\nEj\n\ni,k|k] = P(Ej\n\nwhere k is drawn uniformly from the remaining 83 images in X\\{i, j}. For a \ufb01xed value of k,\nE[1\n\ni,k|k) = P(\u201ci more similar to j than to k\u201d)P(\u201cj more similar to i than to k\u201d).\nwhere the probabilities are the empirical probabilities of the associated triplets in the dataset. This\ndistance is a quasi-metric on our dataset as it does not always satisfy the triangle inequality; but\nsatis\ufb01es it with a multiplicative constant: di,j \uf8ff 1.47(di,k + dj,k) 8i, j, k. Relaxing metrics to\nquasi-metrics has a rich history in the classical nearest neighbors literature [16, 29, 14], and ANNTri\ncan be trivially modi\ufb01ed to handle quasi-metrics. However, we empirically note that < 1% of the\ndistances violate the ordinary triangle inequality here so we ignore this point in our evaluation.\n\n8\n\n\f(a) Sample complexity gains\n\n(b) Comparison to STE\n\nFigure 3: Performance of ANNTri on the Zappos dataset. ANNTri achieves superior performance\nover STE in identifying nearest neighbors and has 5 10x gains in sample ef\ufb01ciency over random.\n5.2.2 Results\nWhen ANNTri or any baseline queries Q(i, j) from the oracle, we randomly sample a third point\nk 2 X\\{i, j} and \ufb02ip a coin with probability P(Ej\ni,k). The resulting sample is an unbiased estimate\nof the distance between i and j. In Fig. 3a, we compare the error rate averaged over 1000 trials\nof ANNTri compared to Random and STE. We also plot associated gains in sample complexity by\nANNTri. In particular, we see gains of 5 10x over random sampling, and gains up to 16x relative\nto ordinal embedding. ANNTri also shows 2x gains over ANN in sample complexity (see Fig. 6 in\nAppendix).\nAdditionally, a standard way of learning from triplet data is to perform ordinal embedding. With a\nlearned embedding, the nearest neighbor graph may easily be computed. In Fig. 3b, we compare\nANNTri against the state of the art STE algorithm [31] for estimating Euclidean embeddings from\ntriplets, and select the embedding dimension of d = 16 via cross validation. To normalize the number\nof samples, we \ufb01rst perform ANNTri with a given max budget of samples and record the total number\nneeded. Then we select a random set of triplets of the same size and learn an embedding in R16 via\nSTE. We compare both methods on the fraction of nearest neighbors predicted correctly. On the x\naxis, we show the total number of triplets given to each method. For small dataset sizes, there is\nlittle difference, however, for larger dataset sizes, ANNTri signi\ufb01cantly outperforms STE. Given that\nANNTri is active, it is reasonable to wonder if STE would perform better with an actively sampled\ndataset, such as [28]. Many of these methods are computationally intensive and lack empirical\nsupport [20], but we can embed using the full set of triplets to mitigate the effect of the subsampling\nprocedure. Doing so, STE achieves 52% error, within the con\ufb01dence bounds of the largest subsample\nshown in Fig. 3b. In particular, more data and more carefully selected datasets, may not correct for\nthe bias induced by forcing Euclidean structure.\n6 Conclusion\n\nIn this paper we solve the nearest neighbor graph problem by adaptively querying distances. Our\nmethod makes no assumptions beyond standard metric properties and is empirically shown to achieve\nsample complexity gains over passive sampling. In the case of clustered data, we show provable\ngains and achieve optimal rates in favorable settings. One interesting avenue for future work would\nbe to specialize to the case of hyperbolic embeddings which naturally encode trees [8] and may be a\nmore \ufb02exible way to describe hierarchical clusters as in Theorem 4.6. Implementations of ANNTri,\nANN, and RANDOM can be found alongside a demo and summary slides at https://github.com/\nblakemas/nngraph.\nAcknowledgments\n\nThe authors wish to thank Lalit Jain for many helpful discussions over the course of this work for\nwhich the paper is better and the reviewers for their helpful suggestions. This work was partially\nsupported by AFOSR/AFRL grants FA8750-17-2-0262 and FA9550-18-1-0166.\n\n9\n\n\fReferences\n[1] Vivek Bagaria, Govinda M Kamath, Vasilis Ntranos, Martin J Zhang, and David Tse. Medoids\n\nin almost linear time via multi-armed bandits. arXiv preprint arXiv:1711.00817, 2017.\n\n[2] Vivek Bagaria, Govinda M Kamath, and David N Tse. Adaptive monte-carlo optimization.\n\narXiv preprint arXiv:1805.08321, 2018.\n\n[3] Nitin Bhatia et al. Survey of nearest neighbor techniques. arXiv preprint arXiv:1007.0085,\n\n2010.\n\n[4] Brandon M Booth, Tiantian Feng, Abhishek Jangalwa, and Shrikanth S Narayanan. Toward\nrobust interpretable human movement pattern analysis in a workplace setting. In ICASSP 2019-\n2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),\npages 7630\u20137634. IEEE, 2019.\n\n[5] Justin Brickell, Inderjit S Dhillon, Suvrit Sra, and Joel A Tropp. The metric nearness problem.\n\nSIAM Journal on Matrix Analysis and Applications, 30(1):375\u2013396, 2008.\n\n[6] Kenneth L Clarkson. Fast algorithms for the all nearest neighbors problem. In 24th Annual\n\nSymposium on Foundations of Computer Science (sfcs 1983), pages 226\u2013232. IEEE, 1983.\n\n[7] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to\n\nalgorithms. MIT press, 2009.\n\n[8] Andrej Cvetkovski and Mark Crovella. Hyperbolic embedding and routing for dynamic graphs.\n\nIn IEEE INFOCOM 2009, pages 1647\u20131655. IEEE, 2009.\n\n[9] Brian Eriksson, Paul Barford, Joel Sommers, and Robert Nowak. A learning-based approach\nfor ip geolocation. In International Conference on Passive and Active Network Measurement,\npages 171\u2013180. Springer, 2010.\n\n[10] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions\nfor the multi-armed bandit and reinforcement learning problems. Journal of machine learning\nresearch, 7(Jun):1079\u20131105, 2006.\n\n[11] Robert W Floyd. Algorithm 97: shortest path. Communications of the ACM, 5(6):345, 1962.\n\n[12] Aur\u00e9lien Garivier. Informational con\ufb01dence bounds for self-normalized averages and applica-\n\ntions. In 2013 IEEE Information Theory Workshop (ITW), pages 1\u20135. IEEE, 2013.\n\n[13] Anna C Gilbert and Lalit Jain. If it ain\u2019t broke, don\u2019t \ufb01x it: Sparse metric repair. In 2017 55th\nAnnual Allerton Conference on Communication, Control, and Computing (Allerton), pages\n612\u2013619. IEEE, 2017.\n\n[14] Navin Goyal, Yury Lifshits, and Hinrich Sch\u00fctze. Disorder inequality: a combinatorial approach\nto nearest neighbor search. In Proceedings of the 2008 International Conference on Web Search\nand Data Mining, pages 25\u201332. ACM, 2008.\n\n[15] Eric Heim, Matthew Berger, Lee Seversky, and Milos Hauskrecht. Active perceptual similarity\n\nmodeling with auxiliary information. arXiv preprint arXiv:1511.02254, 2015.\n\n[16] Michael E Houle and Michael Nett. Rank-based similarity search: Reducing the dimensional\ndependence. IEEE transactions on pattern analysis and machine intelligence, 37(1):136\u2013150,\n2015.\n\n[17] Lalit Jain, Kevin G Jamieson, and Rob Nowak. Finite sample prediction and recovery bounds for\nordinal embedding. In Advances In Neural Information Processing Systems, pages 2711\u20132719,\n2016.\n\n[18] K. Jamieson and R. Nowak. Best-arm identi\ufb01cation algorithms for multi-armed bandits in the\n\ufb01xed con\ufb01dence setting. In 2014 48th Annual Conference on Information Sciences and Systems\n(CISS), pages 1\u20136, March 2014. doi: 10.1109/CISS.2014.6814096.\n\n10\n\n\f[19] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sebastien Bubeck. On \ufb01nding the largest\n\nmean among many. arXiv preprint arXiv:1306.3917, 2013.\n\n[20] Kevin G Jamieson, Lalit Jain, Chris Fernandez, Nicholas J Glattard, and Rob Nowak. Next: A\nsystem for real-world development, evaluation, and application of active learning. In Advances\nin Neural Information Processing Systems, pages 2656\u20132664, 2015.\n\n[21] Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset\n\nselection. In Conference on Learning Theory, pages 228\u2013251, 2013.\n\n[22] Matth\u00e4us Kleindessner and Ulrike Von Luxburg. Lens depth function and k-relative neigh-\nborhood graph: versatile tools for ordinal data analysis. The Journal of Machine Learning\nResearch, 18(1):1889\u20131940, 2017.\n\n[23] Joseph B Kruskal. Nonmetric multidimensional scaling: a numerical method. Psychometrika,\n\n29(2):115\u2013129, 1964.\n\n[24] Daniel LeJeune, Richard G Baraniuk, and Reinhard Heckel. Adaptive estimation for approxi-\n\nmate k-nearest-neighbor computations. arXiv preprint arXiv:1902.09465, 2019.\n\n[25] Jagan Sankaranarayanan, Hanan Samet, and Amitabh Varshney. A fast all nearest neighbor\nalgorithm for applications involving large point-clouds. Computers & Graphics, 31(2):157\u2013174,\n2007.\n\n[26] Roger N Shepard. The analysis of proximities: multidimensional scaling with an unknown\n\ndistance function. i. Psychometrika, 27(2):125\u2013140, 1962.\n\n[27] Adish Singla, Sebastian Tschiatschek, and Andreas Krause. Actively learning hemimetrics with\napplications to eliciting user preferences. In International Conference on Machine Learning,\npages 412\u2013420, 2016.\n\n[28] Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam Tauman Kalai. Adaptively\n\nlearning the crowd kernel. arXiv preprint arXiv:1105.1033, 2011.\n\n[29] Dominique Tschopp, Suhas Diggavi, Payam Delgosha, and Soheil Mohajer. Randomized\nIn Advances in Neural Information Processing\n\nalgorithms for comparison-based search.\nSystems, pages 2231\u20132239, 2011.\n\n[30] Pravin M Vaidya. Ano (n logn) algorithm for the all-nearest-neighbors problem. Discrete &\n\nComputational Geometry, 4(2):101\u2013115, 1989.\n\n[31] Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In 2012 IEEE\nInternational Workshop on Machine Learning for Signal Processing, pages 1\u20136. IEEE, 2012.\n[32] A. Yu and K. Grauman. Fine-grained visual comparisons with local learning. In Computer\n\nVision and Pattern Recognition (CVPR), Jun 2014.\n\n[33] A. Yu and K. Grauman. Semantic jitter: Dense supervision for visual comparisons via synthetic\n\nimages. In International Conference on Computer Vision (ICCV), Oct 2017.\n\n11\n\n\f", "award": [], "sourceid": 5090, "authors": [{"given_name": "Blake", "family_name": "Mason", "institution": "University of Wisconsin - Madison"}, {"given_name": "Ardhendu", "family_name": "Tripathy", "institution": "University of Wisconsin - Madison"}, {"given_name": "Robert", "family_name": "Nowak", "institution": "University of Wisconsion-Madison"}]}*