{"title": "Multiple Relational Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 913, "page_last": 920, "abstract": null, "full_text": " Multiple Relational Embedding\n\n\n\n Roland Memisevic Geoffrey Hinton\n Department of Computer Science Department of Computer Science\n University of Toronto University of Toronto\n roland@cs.toronto.edu hinton@cs.toronto.edu\n\n\n\n\n Abstract\n\n We describe a way of using multiple different types of similarity rela-\n tionship to learn a low-dimensional embedding of a dataset. Our method\n chooses different, possibly overlapping representations of similarity by\n individually reweighting the dimensions of a common underlying latent\n space. When applied to a single similarity relation that is based on Eu-\n clidean distances between the input data points, the method reduces to\n simple dimensionality reduction. If additional information is available\n about the dataset or about subsets of it, we can use this information to\n clean up or otherwise improve the embedding. We demonstrate the po-\n tential usefulness of this form of semi-supervised dimensionality reduc-\n tion on some simple examples.\n\n\n1 Introduction\n\nFinding a representation for data in a low-dimensional Euclidean space is useful both for\nvisualization and as prelude to other kinds of data analysis. The common goal underly-\ning the many different methods that accomplish this task (such as ISOMAP [1], LLE [2],\nstochastic neighbor embedding [3] and others) is to extract the usually small number of\nfactors that are responsible for the variability in the data. In making the underlying factors\nexplicit, these methods help to focus on the kind of variability that is important and provide\nrepresentations that make it easier to interpret and manipulate the data in reasonable ways.\n\nMost dimensionality reduction methods are unsupervised, so there is no way of guiding\nthe method towards modes of variability that are of particular interest to the user. There\nis also no way of providing hints when the true underlying factors are too subtle to be\ndiscovered by optimizing generic criteria such as maximization of modeled variance in\nPCA, or preservation of local geometry in LLE. Both these difficulties can be alleviated by\nallowing the user to provide more information than just the raw data points or a single set\nof pairwise similarities between data points.\n\nAs an example consider images of faces. Nonlinear methods have been shown to find\nembeddings that nicely reflect the variability in the data caused by variation in face identity,\npose, position, or lighting effects. However, it is not possible to tell these methods to\nextract a particular single factor for the purpose of, say intelligent image manipulation or\npose identification, because the extracted factors are intermingled and may be represented\nsimultaneously across all latent space dimensions.\n\nHere, we consider the problem of learning a latent representation for data based on knowl-\n\n\f\nedge that is provided by a user in the form of several different similarity relations. Our\nmethod, multiple relational embedding (MRE), finds an embedding that uses a single la-\ntent data representation, but weights the available latent space dimensions differently to\nallow the latent space to model the multiple different similarity relations. By labeling a\nsubset of the data according to the kind of variability one is interested in, one can en-\ncourage the model to reserve a subset of the latent dimensions for this kind of variability.\nThe model, in turn, returns a \"handle\" to that latent space in the form of a corresponding\nlearned latent space metric. Like stochastic neighbor embedding, MRE can also be derived\nas a simplification of Linear Relational Embedding[4].\n\n\n\n1.1 Related work\n\n\nThe problem of supplementing methods for unsupervised learning with \"side-information\"\nin order to influence their solutions is not new and many different approaches have been\nsuggested. [5], for example, describes a way to inform a PCA model by encouraging it to\npreserve a user-defined grouping structure; [6] consider the problem of extracting exactly\ntwo different kinds of factors, which they denote \"style\" and \"content\", by using bilinear\nmodels; more recently, [7] and [8] took a quite different approach to informing a model.\nThey suggest pre-processing the input data by learning a metric in input space that makes\nthe data respect user defined grouping constraints.\n\nOur approach differs from these and other methods in two basic ways. First, in all the\nmethods mentioned above, the side-information has to be defined in terms of equivalence\nconstraints. That is, a user needs to define a grouping structure for the input data by in-\nforming the model which data-points belong together. Here, we consider a rather different\napproach, where the side-information can be encoded in the form of similarity relations.\nThis allows arbitrary continuous degrees of freedom to constrain the low-dimensional em-\nbeddings. Second, our model can deal with several, possibly conflicting, kinds of side-\ninformation. MRE dynamically \"allocates\" latent space dimensions to model different\nuser-provided similarity relations. So inconsistent relations are modeled in disjoint sub-\nspaces, and consistent relations can share dimensions. This scheme of sharing the dimen-\nsions of a common latent space is reminiscent of the INDSCAL method [9] that has been\npopular in the psychometric literature.\n\nA quite different way to extend unsupervised models has recently been introduced by [10]\nand [11], where the authors propose ways to extract common factors that underlie two\nor more different datasets, with possibly different dimensionalities. While these meth-\nods rely on a supervision signal containing information about correspondences between\ndata-points in different datasets, MRE can be used to discover correspondences between\ndifferent datasets using almost no pre-defined grouping constraints.\n\n\n\n2 Multiple Relational Embedding\n\n\nIn the following we derive MRE as an extension to stochastic neighbor embedding (SNE).\nLet X denote the matrix of latent space elements arranged column-wise, and let 2 be some\nreal-valued neighborhood variance or \"kernel bandwidth\". SNE finds a low-dimensional\nrepresentation for a set of input data points yi(i = 1, . . . , N ) by first constructing a simi-\nlarity matrix P with entries\n\n\n exp(- 1 yi - yj 2)\n P 2\n ij := (1)\n exp(- 1 yi - yk 2)\n k 2\n\n\f\nand then minimizing (w.r.t. the set of latent space elements xi(i = 1, . . . , N )) the mismatch\nbetween P and the corresponding latent similarity matrix Q(X) defined by\n\n exp(- xi - xj 2)\n Qij(X) := . (2)\n exp(- xi - xk 2)\n k\n\n\nThe (row-) normalization of both matrices arises from SNE's probabilistic formulation in\nwhich the (i, j)th entry of P and Q is interpreted as the probability that the ith data-point\nwill pick the jth point as its neighbor (in observable and latent space, respectively). The\nmismatch is defined as the sum of Kullback-Leibler-divergences between the respective\nrows [3].\n\nOur goal is to extend SNE so that it learns latent data representations that not only approx-\nimate the input space distances well, but also reflect additional characteristics of the input\ndata that one may be interested in. In order to accommodate these additional characteris-\ntics, instead of defining a single similarity-matrix that is based on Euclidean distances in\ndata space, we define several matrices P c, (c = 1, . . . , C), each of which encodes some\nknown type of similarity of the data. Proximity in the Euclidean data-space is typically one\nof the types of similarity that we use, though it can easily be omitted. The additional types\nof similarity may reflect any information that the user has access to about any subsets of\nthe data provided the information can be expressed as a similarity matrix that is normalized\nover the relevant subset of the data.\n\nAt first sight, a single latent data representation seems to be unsuitable to accommodate the\ndifferent, and possibly incompatible, properties encoded in a set of P c-matrices. Since our\ngoal, however, is to capture possibly overlapping relations, we do use a single latent space\nand in addition we define a linear transformation Rc of the latent space for each of the C\ndifferent similarity-types that we provide as input. Note that this is equivalent to measuring\ndistances in latent space using a different Mahalanobis metric for each c corresponding to\nthe matrix RcT Rc .\n\nIn order to learn the transformations Rc from the data along with the set of latent represen-\ntations X we consider the loss function\n\n E(X) = Ec(X), (3)\n c\n\nwhere we define\n\n 1 P c\n Ec(X) := P c log ij and Qc := Q\n N ij Qc ij ij (RcX ). (4)\n i,j ij\n\nNote that in the case of C = 1, R1 = I (and fixed) and P 1 defined as in Eq. (1) this\nfunction simplifies to the standard SNE objective function. One might consider weighting\nthe contribution of each similarity-type using some weighting factor c. We found that\nthe solutions are rather robust with regard to different sets of c and weighted all error\ncontributions equally in our experiments.\n\nAs indicated above, here we consider diagonal R-matrices only, which simply amounts to\nusing a rescaling factor for each latent space dimension. By allowing each type of similarity\nto put a different scaling factor on each dimension the model allows similarity relations\nthat \"overlap\" to share dimensions. Completely unrelated or \"orthogonal\" relations can be\nencoded by using disjoint sets of non-zero scaling factors.\n\nThe gradient of E(X) w.r.t. a single latent space element xl takes a similar form to the\ngradient of the standard SNE objective function and is given by\n\n E(X) 2\n = (P c + P c - Qc - Qc ) RcT Rc(xl - xi), (5)\n xl N il li li il\n c i\n\n\f\n 2\n\n 0\n\n -2\n\n -4\n\n -6 REucl\n 0.5 0.5 1 2 3\n 0 0\n\n -0.5 -0.5 2\n 0.5 0.5\n 1 1 0\n\n 0.5 0 0.5 0 -2\n\n 0 0 -4\n -0.5 -0.5\n -0.5 -0.5 -6 RClass\n\n -1 -1 -1 -1 1 2 3\n\n\n\nFigure 1: Embedding of images of rotated objects. Left: SNE, right: MRE. Latent rep-\nresentatives are colored on a gray-scale corresponding to angle of rotation in the original\nimages. The rightmost plots show entries on the diagonals of latent space transformations\nREucl and RClass.\n\n\nthe gradient w.r.t. to a single entry of the diagonal of Rc reads\n\n\n E(X) 2\n = Rc P c - Qc (xi - xj)2, (6)\n Rc N ll ij ij l l\n ll i j \nwhere xi denotes the lth component of the ith latent representative.\n l\n\nAs an illustrative example we ran MRE on a set of images from the Columbia object images\nlibrary (COIL) [12]. The dataset contains (128 128)-dimensional gray-scale images of\ndifferent objects that vary only by rotation, i.e. by a single degree of freedom. We took\nthree subsets of images depicting toy-cars, where each subset corresponds to one of three\ndifferent kinds of toy-cars, and embedded the first 30 images of each of these subsets in a\nthree-dimensional space. We used two similarity relations: The first, P Eucl, corresponds\nto the standard SNE objective; the second, P Class, is defined as a block diagonal matrix\nthat contains homogeneous blocks of size 30 30 with entries ( 1 ) and models class\n 30\nmembership, i.e. we informed the model using the information that images depicting the\nsame object class belong together.\n\nWe also ran standard SNE on the same dataset1. The results are depicted in figure 1. While\nSNE's unsupervised objective to preserve Euclidean distances leads to a representation\nwhere class-membership is intermingled with variability caused by object rotation (left-\nmost plot), in the MRE approximation the contribution of class-membership is factored out\nand represented in a separate dimension (next plot). This is also reflected in the entries\non the diagonal of the corresponding R-matrices, depicted in the two right-most plots.\nRClass is responsible for representing class membership and can do so using just a single\ndimension. REucl on the other hand makes use of all dimensions to some degree, reflecting\nthe fact that the overall variability in \"pixel-space\" depends on class-membership, as well\nas on other factors (here mainly rotation). Note that with the variability according to class-\n\n 1For training we set 2 manually to 5 107\n for both SNE and MRE and initialized all entries in X\nand the diagonals of all Rc with small normally distributed values. In all experiments we minimized\nthe loss function defined in Eq. (3) using Carl Rasmussens' matlab function \"minimize\" for 200\niterations (simple gradient descent worked equally well, but was much slower).\n\n\f\nmembership factored out, the remaining two dimensions capture the rotational degree of\nfreedom very cleanly.\n\n\n2.1 Partial information\n\nIn many real world situations there might be side-information available only for a subset\nof the data-points, because labelling a complete dataset could be too expensive or for other\nreasons impossible. A partially labelled dataset can in that case still be used to provide\na hint about the kind of variability that one is interested in. In general, since the corre-\nsponding transformation Rc provides a way to access the latent space that represents the\ndesired similarity-type, a partially labelled dataset can be used to perform a form of super-\nvised feature extraction in which the labelled data is used to specify a kind of feature \"by\nexample\". It is straightforward to modify the model to deal with partially labelled data.\nFor each type of similarity c that is known to hold for a subset containing N c examples,\nthe corresponding P c-matrix references only this subset of the complete dataset and is thus\nan N c N c-matrix. To keep the latent space elements not corresponding to this subset\nunaffected by this error contribution, we can define for each c an index set I c containing\njust the examples referenced by P c and rewrite the loss for that type of similarity as\n\n 1 P c\n Ec(X) := P c log ij . (7)\n N c ij Qc\n i,jIc ij\n\n\n\n3 Experiments\n\n3.1 Learning correspondences between image sets\n\nIn extending the experiment described in section 2 we trained MRE to discover correspon-\ndences between sets of images, in this case with different dimensionalities. We picked 20\nsuccessive images from one object of the COIL dataset described above and 28 images\n(112 92 pixels) depicting a person under different viewing angles taken from the UMIST\ndataset[13]. We chose this data in order to obtain two sets of images that vary in a \"similar\"\nor related way. Note that, because the datasets have different dimensionalities, here it is not\npossible to define a single relation describing Euclidean distance between all data-points.\nInstead we constructed two relations P Coil and P Umist (for both we used Eq. (1) with 2\nset as in the previous experiment), with corresponding index-sets I Coil and IUmist contain-\ning the indices of the points in each of the two datasets. In addition we constructed one\nclass-membership relation in the same way as before and two identical relations P 1 and\nP 2 that take the form of a 2 2-matrix filled with entries 1 . Each of the corresponding\n 2\nindex sets I1 and I2 points to two images (one from each dataset) that represent the end\npoints of the rotational degree of freedom, i.e. to the first and the last points if we sort the\ndata according to rotation (see figure 2, left plot). These similarity types are used to make\nsure that the model properly aligns the representations of the two different datasets. Note\nthat the end points constitute the only supervision signal; we did not use any additional\ninformation about the alignment of the two datasets.\n\nAfter training a two-dimensional embedding2, we randomly picked latent representatives\nof the COIL images and computed reconstructions of corresponding face images using a\nkernel smoother (i.e. as a linear combination of the face images with coefficients based on\nlatent space distances). In order to factor out variability corresponding to class membership\nwe first multiplied all latent representatives by the inverse of Rclass. (Note that such a\nstrategy will in general blow up the latent space dimensions that do not represent class\nmembership, as the corresponding entries in Rclass may contain very small values. The\n\n 2Training was done using 500 iterations with a setup as in the previous experiment.\n\n\f\nFigure 2: Face reconstructions by alignment. Left: Side-information in form of two image\npairs in correspondence. Right: Reconstructions of face images from randomly chosen cat\nimages.\n\n\n\nkernel smoother consequently requires a very large kernel bandwidth, with the net effect\nthat the latent representation effectively collapses in the dimensions that correspond to class\nmembership which is exactly what we want.) The reconstructions, depicted in the right\nplot of figure 2, show that the model has captured the common mode of variability.\n\n\n3.2 Supervised feature extraction\n\nTo investigate the ability of MRE to perform a form of \"supervised feature extraction\" we\nused a dataset of synthetic face images that originally appeared in [1]. The face images\nvary according to pose (two degrees of freedom) and according to the position of a lighting\nsource (one degree of freedom). The corresponding low-dimensional parameters are avail-\nable for each data-point. We computed an embedding with the goal of obtaining features\nthat explicitly correspond to these different kinds of variability in the data.\n\nWe labelled a subset of 100 out of the total of 698 data-points with the three mentioned\ndegrees of freedom in the following way: After standardizing the pose and lighting param-\neters so that they were centered and had unit variance, we constructed three corresponding\nsimilarity matrices (P Pose1, P Pose2, P Lighting) for a randomly chosen subset of 100 points\nusing Eq. (1) and the three low-dimensional parameter sets as input data. In addition we\nused a fourth similarity relation P Ink, corresponding to overall brightness or \"amount of\nink\", by constructing for each image a corresponding feature equal to the sum of its pixel\nintensities and then defining the similarity matrix as above. We set the bandwidth parame-\nter 2 to 1.0 for all of these similarity-types3. In addition we constructed the standard SNE\nrelation P Eucl (defined for all data-points) using Eq. (1) with 2 set4 to 100.\n\nWe initialized the model as before and trained for 1000 iterations of 'minimize' to find\nan embedding in a four-dimensional space. Figure 3 (right plot) shows the learned latent\nspace metrics corresponding to the five similarity-types. Obviously, MRE devotes one\ndimension to each of the four similarity-types, reflecting the fact that each of them describes\na single one-dimensional degree of freedom that is barely correlated with the others. Data-\nspace similarities in contrast are represented using all dimensions. The plots on the left\nof figure 3 show the embedding of the 598 unlabelled data-points. The top plot shows the\nembedding in the two dimensions in which the two \"pose\"-metrics take on their maximal\nvalues, the bottom plot shows the dimensions in which the \"lighting\"- and \"ink\"-metric take\non their maximal values. The plots show that MRE generalizes over unlabeled data: In each\ndimension the unlabeled data is clearly arranged according to the corresponding similarity\ntype, and is arranged rather randomly with respect to other similarity types. There are\na few correlations, in particular between the first pose- and the \"ink\"-parameter, that are\ninherent in the dataset, i.e. the data does not vary entirely independently with respect to\nthese parameters. These correlations are also reflected in the slightly overlapping latent\n\n 3This is certainly not an optimal choice, but we found the solutions to be rather robust against\nchanges in the bandwidth, and this value worked fine.\n 4See previous footnote.\n\n\f\n 0.3\n\n\n\n REucl\n 5\n 0.2\n\n\n 0\n\n 0.1 -5\n\n\n 1 2 3 4\n 0\n x1\n\n RLights\n 5\n -0.1\n\n\n\n 0\n\n -0.2\n -5\n\n\n -0.3 1 2 3 4\n\n\n -0.2 -0.1 0 0.1 0.2 0.3 RPose2\n x4 5\n\n\n 0\n\n 0.3 -5\n\n\n 1 2 3 4\n 0.2\n\n\n\n\n RPose1\n 5\n 0.1\n\n\n\n 0\n\n 0\n x3\n -5\n\n\n -0.1 1 2 3 4\n\n\n RInk\n -0.2 5\n\n\n 0\n -0.3\n\n\n -5\n -0.2 -0.1 0 0.1 0.2 0.3\n\n x2 1 2 3 4\n\n\nFigure 3: Left: Embedding of faces images that were not informed about their low-\ndimensional parameters. For a randomly chosen subset of these (marked with a circle),\nthe original images are shown next to their latent representatives. Right: Entries on the\ndiagonals of five latent space transformations.\n\n\f\nspace weight sets. MRE gets the pose-embedding wrong for a few very dark images that\nare apparently too far away in the data space to be associated with the correct labeled data-\npoints.\n\n\n4 Conclusions\n\nWe introduced a way to embed data in a low-dimensional space using a set of similarity\nrelations. Our experiments indicate that the informed feature extraction that this method fa-\ncilitates will be most useful in cases where conventional dimensionality reduction methods\nfail because of their completely unsupervised nature. Although we derived our approach\nas an extension to SNE, it should be straightforward to apply the same idea to other dimen-\nsionality reduction methods.\n\nAcknowledgements: Roland Memisevic is supported by a Government of Canada Award.\nGeoffrey Hinton is a fellow of CIAR and holds a CRC chair. This research was also sup-\nported by grants from NSERC and CFI.\n\n\nReferences\n\n [1] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for\n nonlinear dimensionality reduction. Science, pages 23192323, 2000.\n\n [2] S.T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding.\n Science, 290, 2000.\n\n [3] Geoffrey Hinton and Sam Roweis. Stochastic neighbor embedding. In Advances in Neural\n Information Processing Systems 15, pages 833840. MIT Press, 2003.\n\n [4] A. Paccanaro and G. E. Hinton. Learning hierarchical structures with linear relational embed-\n ding. In Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT\n Press.\n\n [5] David Cohn. Informed projections. In Advances in Neural Information Processing Systems 15,\n pages 849856. MIT Press, 2003.\n\n [6] Joshua B. Tenenbaum and William T. Freeman. Separating style and content with bilinear\n models. Neural Computation, 12(6):12471283, 2000.\n\n [7] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Distance metric learn-\n ing with application to clustering with side-information. In Advances in Neural Information\n Processing Systems 15, pages 505512. MIT Press, Cambridge, MA, 2003.\n\n [8] Michinari Momma Tijl De Bie and Nello Cristianini. Efficiently learning the metric using side-\n information. In Proc. of the 14th International Conference on Algorithmic Learning Theory,\n 2003.\n\n [9] J. Douglas Carroll and Jih-Jie Chang. Analysis of individual differences in multidimensional\n scaling via an n-way generalization of \"eckart-young\" decomposition. Psychometrika, 35(3),\n 1970.\n\n[10] J. H. Ham, D. D. Lee, and L. K. Saul. Learning high dimensional correspondences from low\n dimensional manifolds. In In Proceedings of the ICML 2003 Workshop on The Continuum from\n Labeled to Unlabeled Data in Machine Learning and Data Mining, pages 3441, Washington,\n D.C., 2003.\n\n[11] Jakob J. Verbeek, Sam T. Roweis, and Nikos Vlassis. Non-linear cca and pca by alignment of lo-\n cal models. In Advances in Neural Information Processing Systems 16. MIT Press, Cambridge,\n MA, 2004.\n\n[12] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (coil-20). Technical\n report, 1996.\n\n[13] Daniel B Graham and Nigel M Allinson. Characterizing virtual eigensignatures for general\n purpose face recognition. 163, 1998.\n\n\f\n", "award": [], "sourceid": 2651, "authors": [{"given_name": "Roland", "family_name": "Memisevic", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}