{"title": "Beyond Vector Spaces: Compact Data Representation as Differentiable Weighted Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 6906, "page_last": 6916, "abstract": "Learning useful representations is a key ingredient to the success of modern machine learning. Currently, representation learning mostly relies on embedding data into Euclidean space. However, recent work has shown that data in some domains is better modeled by non-euclidean metric spaces, and inappropriate geometry can result in inferior performance. In this paper, we aim to eliminate the inductive bias imposed by the embedding space geometry. Namely, we propose to map data into more general non-vector metric spaces: a weighted graph with a shortest path distance. By design, such graphs can model arbitrary geometry with a proper configuration of edges and weights. Our main contribution is PRODIGE: a method that learns a weighted graph representation of data end-to-end by gradient descent. Greater generality and fewer model assumptions make PRODIGE more powerful than existing embedding-based approaches. We confirm the superiority of our method via extensive experiments on a wide range of tasks, including classification, compression, and collaborative filtering.", "full_text": "Beyond Vector Spaces: Compact Data Representation\n\nas Differentiable Weighted Graphs\n\nDenis Mazur\u2217\n\nYandex\n\ndenismazur@yandex-team.ru\n\nVage Egiazarian\u2217\n\nSkoltech\n\nVage.egiazarian@skoltech.ru\n\nStanislav Morozov\u2217\n\nYandex\n\nLomonosov Moscow State University\n\nstanis-morozov@yandex.ru\n\nArtem Babenko\n\nYandex\n\nNational Research University\nHigher School of Economics\n\nartem.babenko@phystech.edu\n\nAbstract\n\nLearning useful representations is a key ingredient to the success of modern ma-\nchine learning. Currently, representation learning mostly relies on embedding data\ninto Euclidean space. However, recent work has shown that data in some domains\nis better modeled by non-euclidean metric spaces, and inappropriate geometry can\nresult in inferior performance. In this paper, we aim to eliminate the inductive\nbias imposed by the embedding space geometry. Namely, we propose to map\ndata into more general non-vector metric spaces: a weighted graph with a shortest\npath distance. By design, such graphs can model arbitrary geometry with a proper\ncon\ufb01guration of edges and weights. Our main contribution is PRODIGE: a method\nthat learns a weighted graph representation of data end-to-end by gradient descent.\nGreater generality and fewer model assumptions make PRODIGE more powerful\nthan existing embedding-based approaches. We con\ufb01rm the superiority of our\nmethod via extensive experiments on a wide range of tasks, including classi\ufb01cation,\ncompression, and collaborative \ufb01ltering.\n\n1\n\nIntroduction\n\nNowadays, representation learning is a major component of most data analysis systems; this compo-\nnent aims to capture the essential information in the data and represent it in a form that is useful for the\ntask at hand. Typical examples include word embeddings [1, 2, 3], image embeddings [4, 5], user/item\nrepresentations in recommender systems [6] and others. To be useful in practice, representation\nlearning methods should meet two main requirements. Firstly, they should be effective, i.e., they\nshould not lose the information needed to achieve high performance in the speci\ufb01c task. Secondly,\nthe constructed representations should be ef\ufb01cient, e.g., have small dimensionality, sparsity, or other\nstructural constraints, imposed by the particular machine learning pipeline. In this paper, we focus\non compact data representations, i.e., we aim to achieve the highest performance with the smallest\nmemory consumption.\nMost existing methods represent data items as points in some vector space, with the Euclidean space\nRn being a \"default\" choice. However, several recent works [7, 8, 9] have demonstrated that the\nquality of representation is heavily in\ufb02uenced by the geometry of the embedding space. In particular,\ndifferent space curvature can be more appropriate for different types of data [8]. While some prior\n\n\u2217Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fworks aim to learn the geometrical properties of the embedding space from data[8], all of them\nassume a vectorial representation of the data, which may be an unnecessary inductive bias.\nIn contrast, we investigate a more general paradigm by proposing to represent \ufb01nite datasets as\nweighted graphs equipped with a shortest path distance. It can be shown that such graphs can represent\nany geometry for a \ufb01nite dataset[10] and can be naturally treated as \ufb01nite metric spaces. Speci\ufb01cally,\nwe introduce Probabilistic Differentiable Graph Embeddings (PRODIGE), a method that learns\na weighted graph from data, minimizing the problem-speci\ufb01c objective function via gradient descent.\nUnlike existing methods, PRODIGE does not learn vectorial embeddings of the data points, instead\ninformation from the data is effectively stored in the graph structure. Via extensive experiments\non several different tasks, we con\ufb01rm that, in terms of memory consumption, PRODIGE is more\nef\ufb01cient than its vectorial counterparts.\nTo sum up, the contributions of this paper are as follows:\n\n1. We propose a new paradigm for constructing representations of \ufb01nite datasets. Instead of\nconstructing pointwise vector representations, our proposed PRODIGE method represents\ndata as a weighted graph equipped with a shortest path distance.\n\n2. Applied to several tasks, PRODIGE is shown to outperform vectorial embeddings when given\nthe same memory budget; this con\ufb01rms our claim that PRODIGE can capture information\nfrom the data more effectively.\n\n3. The PyTorch source code of PRODIGE is available online1.\n\nThe rest of the paper is organized as follows. We discuss the relevant prior work in Section 2 and\ndescribe the general design of PRODIGE in Section 3. In Section 4 we consider several practical\ntasks and demonstrate how they can bene\ufb01t from the usage of PRODIGE as a drop-in replacement\nfor existing vectorial representation methods. Section 5 concludes the paper.\n\n2 Related work\n\nIn this section, we brie\ufb02y review relevant prior work and describe how the proposed approach relates\nto the existing machine learning concepts.\nEmbeddings. Vectorial representations of data have been used in machine learning systems for\ndecades. The \ufb01rst representations were mostly hand-engineered and did not involve any learning,\ne.g. SIFT[11] and GIST[12] representations for images, and n-gram frequencies for texts. The recent\nsuccess of machine learning is largely due to the transition from the handcrafted to learnable data\nrepresentations in domains, such as NLP[1, 2, 3], vision[4], speech[13]. Most applications now use\nlearnable representations, which are typically vectorial embeddings in Euclidean space.\nEmbedding space geometry. It has recently been shown that Euclidean space is a suboptimal\nmodel for several data domains [7, 8, 9]. In particular, spaces with a hyperbolic geometry are\nmore appropriate to represent data with a latent hierarchical structure, and several papers investigate\nhyperbolic embeddings for different applications [14, 15, 16]. The current consensus appears to be\nthat different geometries are appropriate for solving different problems and there have been attempts\nto learn these geometrical properties (e.g. curvature) from data [8]. Instead, our method aims to\nrepresent data as a weighted graph with a shortest path distance; this by design can express an\narbitrary \ufb01nite metric space.\nConnections to graph embeddings. Representing the vertices of a given graph as vectors is a\nlong-standing problem in machine learning and complex networks communities. Modern approaches\nto this problem rely on graph theory and/or graph neural networks [17, 18], which are both areas of\nactive research. In some sense, we aim to solve the inverse problem; given data entities, our goal is to\nlearn a graph approximating the distances that satisfy the requirements of the task at hand.\nExisting works on graph learning. We are not aware of prior work that proposes a general method to\nrepresent data as a weighted graph for any differentiable objective function. Probably, the closest work\nto ours is [19]; they solve the problem of distance-preserving compression via a dual optimization\nproblem. Their proposed approach lacks end-to-end learning and does not generalize to arbitrary\nloss functions. There are also several approaches that learn graphs from data for speci\ufb01c problems.\n\n1https://github.com/stanis-morozov/prodige\n\n2\n\n\fSome studies[20, 21] learn specialized graphs that perform clustering or semi-supervised learning.\nOthers[22, 23] focus speci\ufb01cally on learning probabilistic graphical model structure. Most of the\nproposed approaches are highly problem-speci\ufb01c and do not scale well to large graphs.\n\n3 Method\n\nIn this section we describe the general design of PRODIGE and its training procedure.\n\n3.1 Learnable Graph Metric Spaces\nConsider an undirected weighted graph G(V, E, w), where V ={v0, v1, . . . , vn} is a set of vertices,\ncorresponding to data items, and E={e0, e1, . . . , em} is a set of edges, ei=e(vsource\n, vtarget\n). We\nuse w\u03b8(ei) to denote non-negative edge weights, which are learnable parameters of our model.\nOur model includes another set of learnable parameters that correspond to the probabilities of edges\nin the graph. Speci\ufb01cally, for each edge ei, we de\ufb01ne a Bernoulli random variable bi \u223c p\u03b8(bi) that\nindicates whether an edge is present in the graph G. For simplicity, we assume that all random\nvariables bi in our model are independent. In this case the joint probability of all edges in the graph\n\ncan be written as p(G) =(cid:81)m\n\ni\n\ni\n\ni=0 p\u03b8(bi).\n\nThe expected distance between any two vertices vi and vj can now be expressed as a sum of edge\nweights along the shortest path:\n\nE\n\nG\u223cp(G)\n\ndG(vi, vj) = E\n\nG\u223cp(G)\n\nmin\n\n\u03c0\u2208\u03a0G(vi,vj )\n\nw\u03b8(ei)\n\n(1)\n\n(cid:88)\n\nei\u2208\u03c0\n\nHere \u03a0G(vi, vj) denotes the set of all paths from vi to vj over the edges of G, or more formally,\n\u03a0G(vi, vj) = {\u03c0 : (e(vi, v...), . . . , e(v..., vj))}. For a given G, the shortest path can be found exactly\nusing Dijkstra\u2019s algorithm. Generally speaking, a shortest path is not guaranteed to exist, e.g., if G is\ndisconnected; in this case we de\ufb01ne dG(vi, vj) to be equal to a suf\ufb01ciently large constant.\nThe parameters of our model must satisfy two constraints: w\u03b8(ei) \u2265 0 for weights and\n0 \u2264 p\u03b8(e) \u2264 1 for probabilities. We avoid constrained optimization by de\ufb01ning w\u03b8(ei) as\nsof tplus(\u03b8w,i)[24] and p\u03b8(bi) as \u03c3(\u03b8b,i).\nThis model can be trained by minimizing an arbitrary differentiable objective function with respect to\nparameters \u03b8 = {\u03b8w, \u03b8b} directly by gradient descent. We explore several problem-speci\ufb01c objective\nfunctions in Section 4.\n\n3.2 Sparsity\n\nIn order to learn a compact representation, we encourage the algorithm to learn sparse graphs by\nimposing additional regularization on p(G). Namely, we employ a recent sparsi\ufb01cation technique\nproposed in [25, 26]. Denoting the task-speci\ufb01c loss function as L(G, \u03b8), the regularized training\nobjective can be written as:\n\n[L(G, \u03b8)] + \u03bb \u00b7 1\n|E|\n\n|E|(cid:88)\n\ni=1\n\np\u03b8(bi = 1)\n\n(2)\n\nR(\u03b8) = E\n\nG\u223cp(G)\n\n(cid:80)|E|\n\nIntuitively, the term 1|E|\ni=1 p\u03b8(bi = 1) penalizes the number of edges being used, on average. It\neffectively encourages sparsity by forcing the edge probabilities to decay over time. For instance, if a\ncertain edge has no effect on the main objective L(G, \u03b8) (e.g., the edge never occurs on any shortest\npath), the optimal probability of that edge being present is exactly zero. The regularization coef\ufb01cient\n\u03bb affects the \u201czeal\u201d of edge pruning, with larger values of \u03bb corresponding to greater sparsity of the\nlearned graph.\n\n3\n\n\fWe minimize this regularized objective function (2) by stochastic gradient descent. The gradients\n\u2207\u03b8wR are straightforward to compute using existing autograd packages, such as TensorFlow or\nPyTorch. The gradients \u2207\u03b8bR are, however, more tricky and require the log-derivative trick[27]:\n\n\u2207\u03b8bR = E\n\nG\u223cp(G)\n\n[L(G, \u03b8) \u00b7 \u2207\u03b8b log p(G)] + \u03bb \u00b7 1\n|E|\n\n\u2207\u03b8b p\u03b8(bi = 1)\n\n(3)\n\n|E|(cid:88)\n\ni=1\n\nIn practice, we can use a Monte-Carlo estimate of gradient (3). However, the variance of this estimate\ncan be too large to be used in practical optimization. To reduce the variance, we use the fact that the\noptimal path usually contains only a tiny subset of all possible edges. More formally, if the objective\nfunction only depends on a subgraph \u02c6G \u2208 G : L( \u02c6G, \u03b8) = L(G, \u03b8), then we can integrate out all\nedges from G \\ \u02c6G:\n\nE\n\nG\u223cp(G)\n\nL(G, \u03b8) \u00b7 \u2207\u03b8b log p(G) = E\n\nL( \u02c6G, \u03b8) \u00b7 \u2207\u03b8b log p( \u02c6G)\n\n\u02c6G\u223cp( \u02c6G)\n\n(4)\n\nThe expression (4) allows for an ef\ufb01cient training procedure that only samples edges that are required\nby Dijkstra\u2019s algorithm. More speci\ufb01cally, on every iteration the path-\ufb01nding algorithm selects a\nvertex vi and explores its neighbors by sampling bi \u223c p\u03b8(bi) corresponding to the edges that are\npotentially connected to vi. In this case, the size of \u02c6G is proportional to the number of iterations of\nDijkstra\u2019s algorithm, which is typically lower than the number of vertices in the original graph G.\nFinally, once the training procedure converges, most edges in the obtained graph are nearly determin-\nistic: p\u03b8(bi = 1) < \u03b5 \u2228 p\u03b8(bi = 1) > 1 \u2212 \u03b5. We make this graph exactly deterministic by keeping\nonly the edges with probability greater or equal to 0.5.\n\n3.3 Scalability\nAs the total number of edges |E| in a complete graph grows quadratically with the number of vertices\n|V |, memory consumption during PRODIGE training on large datasets can be infeasible. To reduce\nmemory requirements, we explicitly restrict a subset of possible edges and learn probabilities only for\nedges from this subset. The subset of possible edges is constructed by a simple heuristic described\nbelow.\nFirst, we add an edge from each data item to k most similar items in terms of problem-speci\ufb01c\nsimilarity in the original data space. Second, we also add r random edges between uniformly chosen\nsource and destination vertices. Overall, the size of the constructed subset scales linearly with the\nnumber of vertices |V |, which makes training feasible for large-scale datasets. In our experiments,\nwe observe that increasing the number of possible edges improves the model performance until some\nsaturation point, typically with 32\u2212100 edges per vertex.\nFinally, we observed that the choice of an optimization algorithm is crucial for the convergence speed\nof our method. In particular, we use SparseAdam[28] as it signi\ufb01cantly outperformed other sparse\nSGD alternatives in our experiments.\n\n4 Applications\n\nIn this section, we experimentally evaluate the proposed PRODIGE model on several practical tasks\nand compare it to task-speci\ufb01c baselines.\n\n4.1 Distance-preserving compression\n\nDistance-preserving compression learns a compact representation of high-dimensional data that pre-\nserves the pairwise distances from the original high-dimensional space. The learned representations\ncan then be used in practical applications, e.g., data visualization.\n\n4\n\n\fObjective. We optimize the squared error between pairwise distances in the original and compressed\nspaces. Since PRODIGE graphs are stochastic, we minimize this objective in expectation over edge\nprobabilities:\n\nE\n\nG\u223cp(G)\n\nL(G, \u03b8) = E\n\nG\u223cp(G)\n\n1\nN 2\n\nN(cid:88)\n\nN(cid:88)\n\ni=0\n\nj=0\n\n(cid:0)(cid:107)xi \u2212 xj(cid:107)2 \u2212 dG(vi, vj)(cid:1)2\n\n(5)\n\nIn the formula above, xi, xj \u2208 X are objects in the original high-dimensional Euclidean space and\nvi, vj \u2208 V are the corresponding graph vertices. Note that the objective (5) resembles the well-known\nMultidimensional Scaling algorithm[29].\nHowever, the objective (5) does not account for the graph size in the learned model. This can likely\nlead to a trivial solution since the PRODIGE graph can simply memorize w\u03b8(e(vi, vj)) = (cid:107)xi \u2212 xj(cid:107)2.\nTherefore, we use the sparsi\ufb01cation technique described earlier in Section 3.2. We also employ the\ninitialization heuristic from Section 3.3 to speed up training. Namely, we start with 64 edges per\nvertex, half of which are links to the nearest neighbors and the other half are random edges.\nExperimental setup. We experiment with three publicly available datasets:\n\ndimensional vectors;\n\n\u2022 MNIST10k: N =104 images from the test set of the MNIST dataset, represented as 784-\n\u2022 GLOVE10k:\ntop-N =104 most frequent words, represented as 300-dimensional pre-\n\u2022 CelebA10K: 128-dimensional embeddings of N =104 random face photographs from the\n\ntrained2 GloVe[2] vectors;\n\nCelebA dataset, produced by deep CNN3.\n\nIn these experiments, we aim to preserve the Euclidean distances (5) for all datasets. Note, however,\nthat any distance function can be used in PRODIGE.\n\nMethod\n\nPRODIGE\n\nMDS\n\nPoincare MDS\n\nPCA\n\n#parameters\nper instance\n3.92 \u00b1 0.02\n\n4\n4\n4\n\ntotal\n\n39.2k\n40k\n40k\n40k\n\n#parameters MNIST10k\n\u2264 4 parameters per instance\n0.00938\n0.05414\n0.04683\n0.30418\n\u2264 8 parameters per instance\n0.00886\n0.01857\n0.01503\n0.16237\n\nGLOVE10k\n\nCelebA10k\n\n0.03289\n0.13142\n0.11386\n0.84912\n\n0.00374\n0.01678\n0.01649\n0.09078\n\n7.65 \u00b1 0.14\n\nMDS\n\nPRODIGE\n\nPoincare MDS\n\n0.00367\n0.00621\n0.00619\n0.05298\nTable 1: Comparison of distance-preserving compression methods for two memory budgets. We\nreport the mean squared error between pairwise distances in the original space and for learned\nrepresentations.\n\n0.02856\n0.05584\n0.04839\n0.62424\n\n76.5k\n80k\n80k\n80k\n\nPCA\n\n8\n8\n8\n\nWe compare our method with three baselines, which construct vectorial embeddings:\n\n\u2022 Multidimensional Scaling (MDS)[29] is a well-known visualization method that mini-\nmizes the similar distance-preserving objective (5), but maps datapoints into Euclidean\nspace of a small dimensionality.\n\u2022 Poincare MDS is a version of MDS that maps data into Poincare Ball. This method approx-\nimates the original distance with a hyperbolic distance between learned vector embeddings:\ndh(xi, xj) = arccosh\n\n(cid:107)xi\u2212xj(cid:107)2\n\n(cid:16)\n\n(cid:17)\n\n1 + 2\n\n2\n\n(1\u2212(cid:107)xi(cid:107)2\n\n2)\u00b7(1\u2212(cid:107)xj(cid:107)2\n2)\n\n2We used a pre-trained gensim model \"glove-wiki-gigaword-300\"; see https://radimrehurek.com/\n\ngensim/downloader.html for details\n\n3The dataset was obtained by running face_recognition package on CelebA images and uniformly sampling\n\n104 face embeddings, see https://github.com/ageitgey/face_recognition\n\n5\n\n\f\u2022 PCA. Principal Component Analysis is the most popular techinique for data compression.\n\nWe include this method for sanity check.\n\nFor all the methods, we compare the performance given the same memory budget. Namely, we\ninvestigate two operating points, corresponding to four and eight 32-bit numbers per data item.\nFor embedding-based baselines, this corresponds to 4-dimensional and 8-dimensional embeddings,\nrespectively. As for PRODIGE, it requires a total of N +2|E| parameters where N =|V | is a number\nof objects and |E| is the number of edges with p\u03b8(bi) \u2265 0.5. The learned graph is represented as a\nsequence of edges ordered by the source vertex, and each edge is stored as a tuple of target vertex\n(int32) and weight (\ufb02oat32). We tune the regularization coef\ufb01cient \u03bb to achieve the overall memory\nconsumption close to the considered operating points. The distance approximation errors for all\nmethods are reported in Table 1, which illustrates that PRODIGE outperforms the embedding-based\ncounterparts by a large margin. These results con\ufb01rm that the proposed graph representations are\nmore ef\ufb01cient in capturing the underlying data geometry compared to vectorial embeddings. We\nalso verify the robustness of our training procedure by running several experiments with different\nrandom initializations and different initial numbers of neighbors. Figure 2 shows the learning curves\nof PRODIGE under various conditions for GLOVE10K dataset and four numbers/vertex budget.\nWhile these results exhibit some variability due to optimization stochasticity, the overall training\nprocedure is stable and robust.\nQualitative results. To illustrate the graphs obtained by learning the PRODIGE model, we train\nit on a toy dataset containing 100 randomly sampled MNIST images of \ufb01ve classes. We start the\ntraining from a full graph, which contains 4950 edges, and increase \u03bb till the moment when only\n5% of edges are preserved. The tSNE[30] plot of the obtained graph, based on dG(\u00b7,\u00b7) distances, is\nshown on Figure 1.\n\nFigure 1: Trained PRODIGE model for a small subset of MNIST dataset, the full graph (left) and a\nzoom-in of two clusters (right). Vertex positions were computed using tSNE[30] over dG(\u00b7,\u00b7) dis-\ntances; colors represent class labels. View interactively: https://tinyurl.com/prodige-graph\n\nFigure 1 reveals several properties of the PRODIGE graphs. First, the number of edges per vertex is\nvery uneven, with a large fraction of edges belonging to a few high degree vertices. We assume that\nthese \"popular\" vertices play the role of \"traf\ufb01c hubs\" for path\ufb01nding in the graph. The shortest path\nbetween distant vertices is likely to begin by reaching the nearest \"hub\", then travel over the \"long\"\nedges to the hub that \"contains\" the target vertex, after which it would follow the short edges to reach\nthe target itself.\nAnother important observation is that non-hub vertices tend to have only a few edges. We conjecture\nthat this is the key to the memory-ef\ufb01ciency of our method. Effectively, the PRODIGE graph\nrepresents most of the data items by their relation to one or several \"landmark\" vertices, corresponding\nto graph hubs. Interestingly, this representation resembles the topology of human-made transport\nnetworks with few dedicated hubs and a large number of peripheral nodes. We plot the distribution\nof vertex degrees on Figure 3, which resembles power-law distribution, demonstrating \"scale-free\"\nproperty of PRODIGE graphs.\n\n6\n\n\fFigure 2: Learning curves, standard deviation over\n\ufb01ve runs shown in pale\n\nFigure 3:\nGLOVE10K with eight params/vertex\n\nVertex degree histogram for\n\nSanity check. In this experiment we also verify that PRODIGE is able to reconstruct graphs from\ndatapoints, which mutual distances are obtained as shortest paths in some graph. Namely, we generate\nconnected Erd\u02ddos\u2013R\u00e9nyi graphs with 10-25 vertices with edge probability p=0.25 and edge weights\nsampled from uniform U (0, 1) distribution. We then train PRODIGE to reconstruct these graphs\ngiven pairwise distances between vertices. Out of 100 random graphs, in 91 cases PRODIGE was\nable to reconstruct all edges that affected shortest paths and all 100 runs the resulting graph had\ndistances approximation error below 10\u22123.\n\n4.2 Collaborative Filtering\n\nIn the next series of experiments, we investigate the usage of PRODIGE in the collaborative \ufb01ltering\ntask, which is a key component of modern recommendation systems.\nConsider a sparse binary user-item matrix F \u2208 {0, 1}m\u00d7n that represents the preferences of m users\nover n items. Fij=1 means that the j-th item is relevant for the i-th user. Fij=0 means that the j-th\nitem is not relevant for the i-th user or the relevance information is absent. The recommendation task\nis to extract the most relevant items for the particular user. A common performance measure for this\ntask is HitRatio@k. This metric measures how frequently there is at least one relevant item among\ntop-k recommendations suggested by the algorithm. Note that in these experiments we consider the\nsimplest setting where user preferences are binary. All experiments are performed on the Pinterest\ndataset[31].\nObjective. Intuitively, we want our model to rank the relevant items above the non-relevant ones.\nWe achieve this via maximizing the following objective:\n\nE\n\nG\u223cp(G)\n\nL(G, \u03b8) = E\n\nG\u223cp(G)\n\ne\u2212dG(ui,x+) +(cid:80)\n\ne\u2212dG(ui,x+)\n\nx\u2212 e\u2212dG(ui,x\u2212)\n\nE\n\nui,x+,x\u2212 log\n\n(6)\n\nFor each user ui, this objective enforces the positive items x+ to be closer in terms of dG(\u00b7,\u00b7) than the\nnegative items x\u2212. We sample positive items x+ uniformly from the training items that are relevant\nto ui. In turn, x\u2212 are sampled uniformly from all items. In practice, we only need to run Dijkstra\u2019s\nalgorithm once per each user ui to calculate the distances to all positive and negative items.\nSimilarly to the previous section, we speed up the training stage by considering only a small subset\nof edges. Namely, we build an initial graph using F as adjacency matrix and add extra edges that\nconnect the nearest users (user-user edges) and the nearest items (item-item edges) in terms of cosine\ndistance between the corresponding rows/columns of F .\nFor this task, we compare the following methods:\n\n\u2022 PRODIGE-normal: PRODIGE method as described above; we restrict a set of possible\nedges to include 16 user-user and item-item edges and all relevant user-item edges available\nin the training data;\n\n7\n\n05001000150020002500101100PRODIGE Mean Squared Error vs training steps, GLOVE10K128 neighbors, 32 random64 neighbors, 16 random64 neighbors, 32 random32 neighbors, 32 random02040608010005001000150020002500Vertex degree histogram, 80k params, GLOVE10Knumber of edges per vertex\f\u2022 PRODIGE-bipartite: a version of PRODIGE that is allowed to learn only edges that\nconnect users to items, counting approximately 30 edges per item. The user-user and the\nitem-item edges are prohibited;\n\u2022 PRODIGE-random: a version of our model that is initialized with completely random\nedges. All user-user, item-item, and user-item edges are sampled uniformly. In total, we\nsample 50 edges per each user and item;\n\n\u2022 SVD: truncated Singular Value Decomposition of user-item matrix4;\n\u2022 ALS: Alternating Least Squares method for implicit feedback[32];\n\u2022 Metric Learning: metric learning approach that learns the user/item embeddings in the\nEuclidean space and optimizes the same objective function (6) with Euclidean distance\nbetween user and item embeddings; all other training parameters are shared with PRODIGE.\n\nThe comparison results for two memory budgets are presented in Table 2. It illustrates that our\nPRODIGE model achieves better overall recommendation quality compared to embedding-based\nmethods. Interestingly, the bipartite graph performs nearly as well as the task-speci\ufb01c heuristic with\nnearest edges. In contrast, starting from a random set of edges results in poor performance, which\nindicates the importance of initial edges choice.\n\nMethod\n\nHR@5\nHR@10\n\nnormal\n\n0.50213\n0.66192\n\n0.48533\n0.64250\n\n0.35587\n0.57492\n\n0.33365\n0.49619\n\n\u2264 8 parameters per user/item\n\nPRODIGE\nbipartite\n\nrandom\n\n\u2264 4 parameters per user/item\n\nSVD\n\nALS\n\nMetric Learning\n\n0.37005\n0.51815\n\n0.45898\n0.60079\n\nHR@5\nHR@10\nTable 2: HitRate@k for the Pinterest dataset for different methods and two memory budgets.\n\n0.57659\n0.77980\n\n0.50113\n0.73485\n\n0.5107\n0.70489\n\n0.48617\n0.70075\n\n0.54728\n0.74891\n\n0.59921\n0.79021\n\n4.3 Sentiment classi\ufb01cation\n\nAs another application, we consider a simple text classi\ufb01cation problem: the algorithm takes a\nsequence of tokens (words) (x0, x1, ..., xT ) as input and predicts a single class label y. This problem\narises in a wide range of tasks, such as sentiment classi\ufb01cation or spam detection\nIn this experiment, we explore the potential applications of learnable weighted graphs as an interme-\ndiate data representation within a multi-layer model. Our goal here is to learn graph edges end-to-end\nusing the gradients from the subsequent layers. To make the whole pipeline fully differentiable, we\ndesign a projection of data items, represented as graph vertices, into feature vectors digestible by\nsubsequent convolutional or dense layers.\nNamely, a data item vi is represented as a vector of distances to K prede\ufb01ned \"anchor\" vertices:\n\nK)(cid:105)\n\n0 , . . . , va\n\n0 ), dG(vi, va\n\n1 ), . . . , dG(vi, va\n\nembG(vi) = (cid:104)dG(vi, va\n(7)\nK} to a graph before training and connect them\nIn practice, we add the \"anchor\" vertices {va\nto random vertices. Note that the \"anchor\" vertices do not correspond to any data object. The\narchitecture used in this experiment is schematically presented on Figure 4.\nIntuitively, the usage of PRODIGE in architecture from Figure 4 can be treated as a generalization of\nvectorial embedding layer. For instance, if a graph contains only the edges between vertices and an-\nK))(cid:105)\nchors, this is equivalent to embedding emb(vi) = (cid:104)w\u03b8(e(vi, va\nwith O(n \u00b7 K) trainable parameters. However, in practice, our regularized PRODIGE model learns a\nmore compact graph by using vertex-vertex edges to encode words via their relation to other words.\nModel and objective. We train a simple sentiment classi\ufb01cation model with four consecutive layers:\nembedding layer, one-dimensional convolutional layer with 32 output \ufb01lters, followed by a global\nmax pooling layer, a ReLU nonlinearity and a \ufb01nal dense layer that predicts class logits. Indeed, this\nmodel is smaller than the state-of-the-art models for sentiment classi\ufb01cation and should be considered\n\n1 )), . . . , w\u03b8(e(vi, va\n\n0 )), w\u03b8(e(vi, va\n\n4we use the Implicit package for both ALS and SVD baselines https://github.com/benfred/implicit\n\n8\n\n\fFigure 4: Model architecture for the sentiment classi\ufb01cation problem. The PRODIGE graph is used\nas an alternative for the standard embedding layer, followed by a simple convolutional architecture.\n\nonly as a proof of concept. We minimize the cross-entropy objective by backpropagation, computing\ngradients w.r.t all trainable parameters including graph edges {\u03b8w, \u03b8b}.\nExperimental setup. We evaluate our model on the IMDB benchmark [33], a popular dataset for\ntext sentiment binary classi\ufb01cation. The data is split into training and test sets, each containing\nN =25, 000 text instances. For simplicity, we only consider M =32, 000 most frequent tokens.\nWe compare our model with embedding-based baselines, which follow the same architecture from\nFigure 4, but with the standard embedding layer instead of PRODIGE. All embeddings are initialized\nwith pre-trained word representations and then \ufb01ne-tuned with the subsequent layers by backpropaga-\ntion. As pretrained embeddings, we use GloVe vectors5 trained on 25, 000 texts from the training set\nand select vectors corresponding to the M most frequent tokens. In PRODIGE, the model graph is\npre-trained by distance-preserving compression of the GloVe embeddings, as described in Section 4.1.\nIn order to encode the \"anchor\" objects, we explicitly add K synthetic objects to the data by running\nK-means clustering and compress the resulting N + K objects by minimizing the objective (5).\n\nRepresentation\n\nAccuracy\nModel size\n\nPRODIGE\n\n100d\n0.8443\n2.16 MB\n\n100d\n0.8483\n\n12.24 MB\n\nGloVe\n\n18d\n\n0.8028\n2.20 MB\n\nTable 3: Evaluation of graph-based and vectorial representations for the sentiment classi\ufb01cation task.\nFor each model, we report test accuracy and the total model size. The results in Table 3 illustrate that\nPRODIGE learns a model with nearly the same quality as full 100-dimensional embeddings with\nmuch smaller size. On the other hand, PRODIGE signi\ufb01cantly outperforms its vectorial counterpart\nof the same model size.\n\n5 Discussion\n\nIn this work, we introduce PRODIGE, a new method constructing representations of \ufb01nite datasets.\nThe method represents the dataset as a weighted graph with a shortest distance path metric, which is\nable to capture geometry of any \ufb01nite metric space. Due to minimal inductive bias, PRODIGE captures\nthe essential information from data more effectively compared to embedding-based representation\nlearning methods. The graphs are learned end-to-end via minimizing any differentiable objective\nfunction, de\ufb01ned by the target task. We empirically con\ufb01rm the advantage of PRODIGE in several\nmachine learning problems and publish its source code online.\nAcknowledgements: Authors would like to thank David Talbot for his numerous suggestions on\npaper readability. We would also like to thank anonymous meta-reviewer for his insightful feedback.\nVage Egiazarian was supported by the Russian Science Foundation under Grant 19-41-04109.\n\nReferences\n[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in neural information\n\n5We used https://pypi.org/project/gensim/ to train GloVe model with recommended parameters\n\n9\n\nv0av1v1av2aConv1DGlobalPoolingLinearLogitsemb(xi)PRODIGEx1x0...xn\fprocessing systems, 2013.\n\n[2] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for\nword representation. In Proceedings of the 2014 conference on empirical methods in natural\nlanguage processing (EMNLP), 2014.\n\n[3] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors\nwith subword information. Transactions of the Association for Computational Linguistics, 2017.\n\n[4] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor\nDarrell. Decaf: A deep convolutional activation feature for generic visual recognition. In\nInternational conference on machine learning, 2014.\n\n[5] Sumit Chopra, Raia Hadsell, Yann LeCun, et al. Learning a similarity metric discriminatively,\n\nwith application to face veri\ufb01cation. In CVPR (1), 2005.\n\n[6] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural\ncollaborative \ufb01ltering. In Proceedings of the 26th International Conference on World Wide Web,\n2017.\n\n[7] Maximillian Nickel and Douwe Kiela. Poincar\u00e9 embeddings for learning hierarchical represen-\n\ntations. In Advances in neural information processing systems, 2017.\n\n[8] Albert Gu, Frederic Sala, Beliz Gunel, and Christopher R\u00e9. Learning mixed-curvature represen-\n\ntations in product spaces. 2018.\n\n[9] Christopher De Sa, Albert Gu, Christopher R\u00e9, and Frederic Sala. Representation tradeoffs for\n\nhyperbolic embeddings. arXiv preprint arXiv:1804.03329, 2018.\n\n[10] RYAN Hopkins. Finite metric spaces and their embedding into lebesgue spaces. 2015.\n\n[11] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal\n\nof computer vision, 60(2):91\u2013110, 2004.\n\n[12] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation\n\nof the spatial envelope. International journal of computer vision, 42(3):145\u2013175, 2001.\n\n[13] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep speaker recognition.\n\narXiv preprint arXiv:1806.05622, 2018.\n\n[14] Alexandru Tifrea, Gary B\u00e9cigneul, and Octavian-Eugen Ganea. Poincar\\\u2019e glove: Hyperbolic\n\nword embeddings. arXiv preprint arXiv:1810.06546, 2018.\n\n[15] Tran Dang Quang Vinh, Yi Tay, Shuai Zhang, Gao Cong, and Xiao-Li Li. Hyperbolic recom-\n\nmender systems. arXiv preprint arXiv:1809.01703, 2018.\n\n[16] Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Ustinova, Ivan Oseledets, and Victor Lem-\n\npitsky. Hyperbolic image embeddings. arXiv preprint arXiv:1904.02239, 2019.\n\n[17] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A\n\ncomprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.\n\n[18] Sergey Ivanov and Evgeny Burnaev. Anonymous walk embeddings. ArXiv, abs/1805.11921,\n\n2018.\n\n[19] Francisco Escolano and Edwin R. Hancock. From points to nodes: Inverse graph embedding\n\nthrough a lagrangian formulation. In CAIP, 2011.\n\n[20] Masayuki Karasuyama and Hiroshi Mamitsuka. Adaptive edge weighting for graph-based\n\nlearning algorithms. Machine Learning, 106:307\u2013335, 2016.\n\n[21] Zhao Kang, Haiqi Pan, Steven C. H. Hoi, and Zenglin Xu. Robust graph learning from noisy\n\ndata. IEEE transactions on cybernetics, 2019.\n\n[22] Yue Yu, Jie Chen, Tian Gao, and Mo Yu. Dag-gnn: Dag structure learning with graph neural\n\nnetworks. CoRR, abs/1904.10098, 2019.\n\n10\n\n\f[23] Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. Dags with no tears:\n\nContinuous optimization for structure learning. In NeurIPS, 2018.\n\n[24] Hao Zheng, Zhanlei Yang, Wenju Liu, Jizhong Liang, and Yanpeng Li. Improving deep neural\nnetworks using softplus units. In 2015 International Joint Conference on Neural Networks,\nIJCNN 2015, Killarney, Ireland, July 12-17, 2015, pages 1\u20134, 2015.\n\n[25] Suraj Srinivas, Akshayvarun Subramanya, and R. Venkatesh Babu. Training sparse neural\nnetworks. 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops\n(CVPRW), pages 455\u2013462, 2016.\n\n[26] Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks\nthrough l_0 regularization. In 6th International Conference on Learning Representations, ICLR\n2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.\n\n[27] Peter W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications\n\nof the ACM, 1990.\n\n[28] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd\nInternational Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May\n7-9, 2015, Conference Track Proceedings, 2015.\n\n[29] Joseph B Kruskal. Nonmetric multidimensional scaling: a numerical method. Psychometrika,\n\n29(2):115\u2013129, 1964.\n\n[30] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine\n\nlearning research, 9(Nov):2579\u20132605, 2008.\n\n[31] Xue Geng, Hanwang Zhang, Jingwen Bian, and Tat-Seng Chua. Learning image and user\nfeatures for recommendation in social networks. In Proceedings of the IEEE International\nConference on Computer Vision, pages 4274\u20134282, 2015.\n\n[32] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative \ufb01ltering for implicit feedback\ndatasets. 2008 Eighth IEEE International Conference on Data Mining, pages 263\u2013272, 2008.\n\n[33] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher\nPotts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting\nof the Association for Computational Linguistics: Human Language Technologies, June 2011.\n\n11\n\n\f", "award": [], "sourceid": 3745, "authors": [{"given_name": "Denis", "family_name": "Mazur", "institution": "Yandex"}, {"given_name": "Vage", "family_name": "Egiazarian", "institution": "Skoltech"}, {"given_name": "Stanislav", "family_name": "Morozov", "institution": "Yandex"}, {"given_name": "Artem", "family_name": "Babenko", "institution": "Yandex"}]}