{"title": "Learning Hierarchical Structures with Linear Relational Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 857, "page_last": 864, "abstract": null, "full_text": "Learning hierarchical structures with\n\nLinear Relational Embedding\n\nAlberto Paccanaro\n\nGeoffrey E. Hinton\n\nGatsby Computational Neuroscience Unit\n\nUCL, 17 Queen Square, London, UK\n\n alberto,hinton\n\n@gatsby.ucl.ac.uk\n\nAbstract\n\nWe present Linear Relational Embedding (LRE), a new method of learn-\ning a distributed representation of concepts from data consisting of in-\nstances of relations between given concepts. Its \ufb01nal goal is to be able\nto generalize, i.e. infer new instances of these relations among the con-\ncepts. On a task involving family relationships we show that LRE can\ngeneralize better than any previously published method. We then show\nhow LRE can be used effectively to \ufb01nd compact distributed representa-\ntions for variable-sized recursive data structures, such as trees and lists.\n\n1 Linear Relational Embedding\n\nOur aim is to take a large set of facts about a domain expressed as tuples of arbitrary sym-\nbols in a simple and rigid syntactic format and to be able to infer other \u201ccommon-sense\u201d\nfacts without having any prior knowledge about the domain. Let us imagine a situation in\nwhich we have a set of concepts and a set of relations among these concepts, and that our\ndata consists of few instances of these relations that hold among the concepts. We want\nto be able to infer other instances of these relations. For example, if the concepts are the\npeople in a certain family, the relations are kinship relations, and we are given the facts\n\u201dAlberto has-father Pietro\u201d and \u201dPietro has-brother Giovanni\u201d, we would like to be able to\ninfer \u201dAlberto has-uncle Giovanni\u201d. Our approach is to learn appropriate distributed rep-\nresentations of the entities in the data, and then exploit the generalization properties of the\ndistributed representations [2] to make the inferences. In this paper we present a method,\nwhich we have called Linear Relational Embedding (LRE), which learns a distributed rep-\nresentation for the concepts by embedding them in a space where the relations between\nconcepts are linear transformations of their distributed representations.\n\ninvolve two concepts.\nLet us consider the case in which all the relations are binary, i.e.\n, and the problem\nIn this case our data consists of triplets\nwe are trying to solve is to infer missing triplets when we are given only few of them.\nInferring a triplet is equivalent to being able to complete it, that is to come up with one of\nits elements, given the other two. Here we shall always try to complete the third element\nof the triplets 1. LRE will then represent each concept in the data as a learned vector in a\n\n\u0002\u0004\u0003\u0006\u0005\b\u0007\t\u0003\u0006\n\f\u000b\u000e\r\u0010\u000f\b\u0011\u0013\u0012\u0014\n\b\u0015\u0017\u0016\u0018\r\u001a\u0019\u001a\u0005\b\u0007\u001b\u0011\u001c\u0003\u001d\u0005\b\u0007\t\u0003\u0006\n\f\u000b\u000e\r\u001f\u001e\u001d \n\n1Methods analogous to the ones presented here that can be used to complete any element of a\n\ntriplet can be found in [4].\n\n\u0001\n\f\b\u0007\n\n\u0011\u0010\n\nthe set of\n\nwill denote the set of\n\nand the matrix which correspond to the concepts and the relation in a certain triplet\n\nEuclidean space and each relationship between the two concepts as a learned matrix that\nmaps the \ufb01rst concept into an approximation to the second concept. Let us assume that\nbinary relations.\n-dimensional\n\ndistinct concepts and\u0002\nsuch triplets containing\u0001\n\u0007\r\f\n\u0011\n\t\n\t\u000b\t\u001d\u0011\n\u0010\u0013\u0012\n\u000f\u0014\u0011\n\t\u000b\t\n\t\u0006\u0011\nconcepts, and\u000e\u000f\u0005\nwhere\u0016\u001c\u0019\n\u0002\u0018\u0016\r\u0019\n\u0019 . If for every triplet\n\u0002!\u0016\n\nour data consists of\nWe shall call this set of triplets\u0003 ;\u0004\u0006\u0005\n\u0002\u0017\u0007\u0015\u0014\nvectors corresponding to the \u0001\nmatrices corresponding to the \u0002\nthis case we shall denote the vector corresponding to the \ufb01rst concept with\u0016\ncorresponding to the second concept with\u0017\n\u0011\u001a\u0017\u001b\u0019\u001e\u001d\u001f\u0004 and\u0010\nwith\u0010\n\u0019\u001e\u001d \u000e\n\u0019%$\n\u0019 , which produces an approximation to\u0017\n\u0011\u001a\u0017\n\u0016&\u0019 as a noisy version of one of the concept vectors, then one way to learn an embedding\nis to maximize the probability that it is a noisy version of the correct completion,\u0017'\u0019 . We\nnoise with a variance of(\u0011)+* on each dimension, the discriminative goodness function that\n\nrelations. Often we shall need to indicate the vectors\n. In\n, the vector\nand the matrix corresponding to the relation\nas\n.\nis the matrix-vector multiplication,\nwe think of\n\nimagine that a concept has an average location in the space, but that each \u201cobservation\u201d of\nthe concept is a noisy realization of this average location. Assuming spherical Gaussian\n\ncorresponds to the log probability of getting the right completion, summed over all training\ntriplets is:\n\n\u0019\u001d\u0011\u001a\u0017\u001b\u0019\nto a vector\u0017#\u0019\n\n. We shall therefore write the triplet\n\nThe operation that relates a pair\n\n\u0002!\u0016\"\u0019\u0010\u0011\n\n24365\n\n?\b<@7BAC<@8:8\n;\r<>=\n;\r<>=\n798:8\n?\u00117\nW0DX[]\\\nE+FZGIH\neBk l\u001cm\nis the discrete delta function andl\nkQu>vxw@y\u0013kzl\"u@{Xm where|\nranges over the vectors inn\nprqJsIt\nfactor in eq.1 ensures that we are minimizing the Kullback-Leibler divergence between^\nw\u000by\n\u0080 andi\n\u0080 over all the triplets, with respect to all the vector and matrix components.\nall the way to\u0083 , because this\n\ndiscrete probability distribution that we want to approximate is therefore: ^\u0011_a`cb\nwhereg\ndiscrete probability distribution:o\nThe}X~\u000b\u007f\u0011\u0080\nando\nbetweenv\nthe trivial\u0081\n4For one-to-many relations we must not decrease the value of\u0082\nwJy\n\u0080 equal to\u007f\u000b\u0080 different vectors, is by collapsing them onto a unique vector.\n\n2We would like our system to assign equal probability to each of the correct completions. The\n\n3The obvious approach to \ufb01nd an embedding would be to minimize the sum of squared distances\n\nwould cause some concept vectors to become coincident. This is because the only way to make\n\nUnfortunately this minimization (almost) always causes all of the vectors and matrices to collapse to\n\n. Our system implements the\nis the normalization factor.\n\nb6g6hji\n\ntoU\n\nsolution.\n\n24365\n\n(2)\n\n\u00190/\n\ne4f\n\n.\n\n\u000f\n\u0001\n\u0007\n\u0001\n\u0007\n \n\u0003\n\u0003\n\u0011\n\u0010\n \n\u0010\n\u0019\n \n\u0010\n\u0016\n\u0019\n\u0011\n\u0010\n\u0019\n\u0019\n \n\u0010\n\u0019\n$\n,\n\u0005\n-\n.\n\u000f\n(\n1\n\u0019\nD\n.\n\nE\nF\nD\n\u0019\n\u0003\n\u0003\n\u0019\n$\n\u0019\n$\nM\n-\n\u0010\n\u0019\n$\nO\n\u001e\nV\n\u0005\n-\n.\n\u000f\n(\n1\n\u0019\nY\n.\n\nE\nF\nd\nN\nd\n_\n`\nb\nh\n\u0080\nv\n\u0080\n\fDuring learning this functionV\nmaximizingV\nobtained by just maximizing,\n\n(for Goodness) is maximized with respect to all the vector\nand matrix components. This gives a much better generalization performance than the one\n. The results presented in the next sections were obtained by\nusing gradient ascent. All the vector and matrix components were updated\nsimultaneously at each iteration. One effective method of performing the optimization\nis conjugate gradient. Learning was fast, usually requiring only a few hundred updates.\nIt is worth pointing out that, in general, different initial con\ufb01gurations and optimization\nalgorithms caused the system to arrive at different solutions, but these solutions were almost\nalways very similar in terms of generalization performance.\n\n2 LRE results\n\nHere we present the results obtained applying LRE to the Family Tree Problem [1]. In this\nproblem, the data consists of people and relations among people belonging to two fami-\nlies, one Italian and one English, shown in \ufb01g.1 (left) 5. All the information in these trees\ncan be represented in simple propositions of the form\n. Us-\ning the relations father, mother, husband, wife, son, daughter, uncle, aunt, brother, sister,\nnephew, niece there are 112 such triplets in the two trees. Fig.1 (right) shows the embed-\nding obtained after training with LRE. Notice how the Italians are linearly separable from\nthe English people. From the Hinton diagram, we can see that each member of a family is\nsymmetric to the corresponding member in the other family. The sign of the third compo-\nnent of the vectors is (almost) a feature for the nationality. When testing the generalization\n\n\u0011\u0013\u0012\u0014\n\b\u0015\u0017\u0016\u0018\r\u001a\u0019\u001a\u0005\b\u0007\u001b\u0011\u0017\u000b\n\n\u0012\u0001\u0010\u0005\b\u0007\n\n\u0012\u0002\n\n\u0005\b\u0007\n\n1\n\n7\n\nChristopher = Penelope\n\nAndrew = Christine\n\n2\n\n8\n\nMargaret = Arthur\n\n9\n\n3\n\n10\n\nVictoria = James\n\n4\n\nColin\n\n6\n\nCharlotte\n\n12\n\nJennifer = Charles\n\n11\n\n5\n\n13\n\n19\n\nAurelio = Maria\n\n14\n\n20\nBortolo = Emma\n\nGrazia = Pierino\n\n21\n\n15\n\nGiannina = Pietro\n\n16\n\n22\n\nAlberto\n\n18\n\nMariemma\n24\n\nDoralice = Marcello\n\n23\n\n17\n\n2\n\n0\n\n\u22122\n\n\u22125\n\n\u22125\n\n0\n\nEnglish \nItalians\n\n0\n\n5\n\n5\n\nFigure 1: Left: Two isomorphic family trees. The symbol \u201c=\u201d means \u201cmarried to\u201d. Right\nTop: layout of the vectors representing the people obtained for the Family Tree Problem in\n3D. Vectors end-points are indicated by *, the ones in the same family tree are connected\ntriplets were used for training. Right Bottom: Hinton diagrams of\nthe 3D vectors shown above. The vector of each person is a column, ordered according to\nthe numbering on the tree diagram on the left.\n\nto each other. All (6(\b*\n\nperformance, for each triplet in the test set\n\n\u0004 according to their probability, given\u0010\ncorrectly all(6(\b*\ntriplets even when*\u0006\u0005\n\nof them, picked at random, had been left out during\ntraining. These results on the Family Tree Problem are much better than the ones obtained\nusing any other method on the same problem: Quinlan\u2019s FOIL [7] could generalize on\ntriplets, while Hinton (1986) and O\u2019Reilly (1996) made one or more errors when only\ntest cases were held out during training.\n\n, we chose as completion the concepts\n\n\u0002\u0018\u0016\n\u0016&\u0019 . The system was generally able to complete\n\n5The names of the Italian family have been altered from those originally used in Hinton (1986) to\n\nmatch those of one of the author\u2019s family.\n\n\u0002\n\u000b\n\n\u000f\n\n\u001e\n \n\u0011\n\u0010\n\u0011\n\u0003\n \n\u0019\n$\n\u0007\n\u0007\n\fset :\n\n\u0016\n\t\n\n\u0002\u0018\u0016\n\nwhere\u0016\n\nFor most problems there exist triplets which cannot be completed. This is the case, for\nexample, of (Christopher, father, ?)\nin the Family Tree Problem. Therefore, here we\nargue that it is not suf\ufb01cient to test generalization by merely testing the completion of\nthose complete-able triplets which have not been used for training. The proper test for\ngeneralization is to see how the system completes any triplet of the kind\nranges over the concepts and R over the relations. We cannot assume to have knowledge\nof which triplets admit a completion, and which do not. To our knowledge this issue has\nnever been analyzed before (even though FOIL handles this problem correctly). To do this\nthe system needs a way to indicate when a triplet does not admit a completion. Therefore,\nis terminated, we build a new probabilistic model around\nthe solution which has been found. This new model is constituted, for each relation, of\nidentical spherical Gaussians, each centered on a concept vector, and a\nUniform distribution. The Uniform distribution will take care of the \u201cdon\u2019t know\u201d answers,\nand will be competing with all the other Gaussians, each representing a concept vector. For\neach relation the Gaussians have different variances and the Uniform a different height.\n, the variances of the\n\nonce the maximization of V\na mixture of \u0001\nThe parameters of this probabilistic model are, for each relation\u0010\nGaussians \n\u0002ZP\nas \u0001\u0003\u0002\u0005\u0004\nunion of a set of complete-able (positive) triplets \u0007\nand \bc\u0005\n\u0016\u000f\u000e\n\u0011>\u0017\u000b\t\nindicates the fact that the result of applying relation\u0010\ncompleted \b\n\t\n/\n\u000eZ/\n\u0001\r\f\nto\u0016\u000f\u000e does not belong to\u0004\ndone by maximizing the following discriminative goodness function \u0013\n\u0001\u0003\u0002\u0014\u0004\t\u0002\u001aP\u0016\u0015\n\u0007\u000b\u001d\n\u001e\u0019\u0018\n\u0002ZP\nE+F0GIH\n\u0001\u0003\u0002\u0005\u0004\nW0;\n\u001e\n\u0002\u001aP\n\u0001\u001a\u0002\u0005\u0004\nGIH\n\u0001\u001a\u0002\u0005\u0004\n !\u001b\n\n; and the relative density under the Uniform distribution, which we shall write\n)I*\u0006\n(negative); that is \u0007\u0015\u0005\nwhere \u0010\n\n. These parameters are learned using a validation set, which will be the\nand a set of pairs which cannot be\n\nWe used this method on the Family Tree Problem using a train, test and validation sets\n\n\u0001\u0003\u0002\u0014\u0004\nwith respect to the \nparameters, while everything else is kept \ufb01xed. Having\nbility distribution over each of the Gaussians and the Uniform distribution given\u0010\nwe compute the proba-\nsystem then chooses a vector\u0007\n. The\n\u001d or the \u201cdon\u2019t know\u201d answer according to those probabili-\nbuilt in the following way. The test set contained (\b* positive triplets chosen at random,\nbut such that there was a triplet per relation. The validation set contained a group of (\b*\npositive and a group of (\u0011* negative triplets, chosen at random and such that each group\nlearning a distributed representation for the entities in the data by maximizingV\ntraining set, we learned the parameters of the probabilistic model by maximizing \u0013\nthe validation set. The resulting system was able to correctly complete all the*\u0006\u0005\u0006\u0005\n\npositive triplets. After\nover the\nover\npossible\n. Figure 2 shows the distribution of the probabilities when completing one\n\ntriplets\ncomplete-able and one uncomplete-able triplet in the test set.\n\nhad a triplet per relation. The train set contained the remaining\n\nlearned these parameters, in order to complete any triplet\n\n*\u0006\n?\u001a\u001e\u00117BA\u001f\u001eIW\n\u001e \u0018\n\u0002ZP\n\n\u000e\b\u0011\u0011\u0010\n. This is\nover the validation\n\nties, as the completion to the triplet.\n\n\u0001\u001a\u0002\u0005\u0004\t\u0002\u001aP\n3I5\n\n\u001e\u0019\u0018\n\n \u001c\u001b\n\n\u0002ZP\n\n\u001e \u0018\n\n\u0007\u000b\u001d\n\n\u0016\"\t\n*\u0006\n\n24365\n\nand\n\n\u000e\u001a/\n\t\n/\n\nE+F\n\n\u0011>\u0016\n\n\u0005\u0006\u0005\n\n(3)\n\n\u0011>\u0016\t\u0011\n\nLRE seems to scale up well to problems of bigger size. We have used it on a much bigger\nversion of the Family Tree Problem, where the family tree is a branch of the real family\ntree of one of the authors containing\n\nrelations used in the Family Tree Problem, there is a total of %\u0001\u0007\n\npositive triplets.\npositive triplets, and a validation set constituted\n\n\u0007\"# people over $ generations. Using the same set\n\nof (\u0011*\nAfter learning using a training set of $I*\u0001\u0007\n\n\u0011\n\u0010\n\u0011\n\u0003\n \n\u0012\n\u001e\n;\n\u001e\n;\n \n\n\u0011\n\u0010\n\t\n\u000f\n\n\u0011\n\u0010\n\u0001\n\u0012\n\u000f\n\u000e\n\u0013\n\u0005\n\u0012\n.\n\u000f\nD\n\u0017\nD\n\u0017\n \n\u0015\nD\n\u0017\nD\n\u0017\n.\nO\n\u0010\n\u000e\n$\n\u0016\n\u000e\nP\nO\n\u001e\n\u001e\n;\n \n\u001b\n\f\n.\n\u000f\n(\n1\n\t\n$\n2\n=\nD\nD\n\u0017\n \n\u0015\nD\n\u0017\nD\n\u0017\n.\nO\n\u0010\n\t\n$\nP\nO\n\u001e\n\u001e\n;\n \n;\n\u0012\n;\n\u0002\n\u0010\n\u0011\n\u0003\n \n$\n\u0016\n\u0002\n\u0010\n\u0003\n \n\u0007\n\fCharlotte uncle\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nEmma aunt\n\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25\n\nFigure 2: Distribution of the probabilities assigned to each concept for one complete-able\n(left) and one uncomplete-able (right) triplet written above each diagram. The complete-\nable triplet has two correct completions but neither of the triplets had been used for training.\nare the probabilities of the people ordered according to the num-\nbering in \ufb01g.1. The last grey bar on the right, is the probability of the \u201cdon\u2019t know\u201d answer.\n\nto*\u0001\u0007\nBlack bars from(\nIU negative triplets, the system is able to complete correctly almost\nIU positive and\n\nby\nall the possible triplets. When many completions are correct, a high probability is always\nassigned to each one of them. Only in few cases is a non-negligible probability assigned to\nsome wrong completions. Almost all the generalization errors are of a speci\ufb01c form. The\nsystem appears to believe that \u201dbrother/sister of\u201d means \u201dson/daughter of parents of\u201d. It\nfails to model the extra restriction that people cannot be their own brother/sister. On the\nother hand, nothing in the data speci\ufb01es this restriction.\n\n3 Using LRE to represent recursive data structures\n\nIn this section, we shall show how LRE can be used effectively to \ufb01nd compact distributed\nrepresentations for variable-sized recursive data structures, such as trees and lists. Here we\ndiscuss binary trees, but the same reasoning applies to trees of any valence. The approach\nis inspired by Pollack\u2019s RAAM architecture [6]. A RAAM is an auto-encoder which is\ntrained using backpropagation. Figure 3 shows the architecture of the network for binary\ntrees. The system can be thought as being composed of two networks. The \ufb01rst one, called\n\nl~\n\nr~\n\nReconstructor\n\nCompressor\n\nl\n\nr\n\n~\nl\n\n1\nR\n\n~\nr\n\nR\n2\n\nC\n\nl\n\nr\n\nC1\n\nR2\n\nC2\n\nNoun Phrase\n\nR1\n\nC1\n\nAdjective\na\n\nR1\n\nR2\n\nC2\n\nR1\n\nC1\n\nVerb Phrase\n\nR2\n\nC2\n\nNoun\nb\n\nVerb\nc\n\nNoun\nd\n\nFigure 3: Left: the architecture of a RAAM for binary trees. The\nlayers are fully con-\nnected. Adapted from [6]. Center: how LRE can be used to learn a representation for\nbinary trees in a RAAM-like fashion. Right: the binary tree structure of the sentences used\nin the experiment.\n\ncompressor encodes two \ufb01xed-width patterns into a single pattern of the same size. The\nsecond one, called reconstructor, decodes a compressed pattern into facsimiles of its parts,\nand determines when the parts should be further decoded. To encode a tree the network\nmust learn as many auto-associations as the total number of non-terminal nodes in the tree.\nThe codes for the terminal nodes are supplied, and the network learns suitable codes for\nthe other nodes. The decoding procedure must decide whether a decoded vector represents\n\n\n\fa terminal node or an internal node which should be further decoded. This is done by\nusing binary codes for the terminal symbols, and then \ufb01xing a threshold which is used for\nchecking for \u201cbinary-ness\u201d during decoding.\n\nrelationships:\n\nand\u0010\n\nThe RAAM approach can be cast as an LRE problem, in which concepts are trees, sub-\ntrees or leaves, or pairs of trees, sub-trees or leaves, and there exist\n\nimplementing the compressor, and\u0010\nimizing V\n\nwhich jointly implement the reconstructor\n(see \ufb01g.3). We can then learn a representation for all the trees, and the matrices by max-\nin eq.2. This formulation, which we have called Hierarchical LRE (HLRE),\nsolves two problems encountered in RAAMs. First, one does not need to supply codes for\nthe leaves of the trees, since LRE will learn an appropriate distributed representation for\nthem. Secondly, one can also learn from the data when to stop the decoding process. In\nfact, the problem of recognizing whether a node needs to be further decoded, is similar\nto the problem of recognizing that a certain triplet does not admit a completion, that we\nsolved in the previous section. While before we built an outlier model for the \u201cdon\u2019t know\u201d\nanswers, now we shall build one for the non-terminal nodes. This can be done by learning\nin eq.3. The set of\nset which\n\nfor relations\u0010\nis not a leaf of the tree, will play the role of the \b\n\nappropriate values of and\nwhere\u0017\n\ntriplets\nappears in eq.3.\n\nmaximizing \u0013\n\nand\u0010\n\n\u0019\u0010\u0011\u001a\u0017\u001b\u0019\n\n\u0002!\u0016B\u0019\u0010\u0011\n\n\\\u0010\u000f\n\nor in\n\n\u0002\u0017\u0016\u0006\u0005\u0011\u0007\n\n\u0011\u0013\u0007\t\u0005\u000b\n\n\u0012\u000e\r\u0015\t\u0011\u001c\u0007\t\u0005\u000b\n\n; verbs were in\n\n; adjectives were in\n\n pretty, young\n\n dog, doctor, lawyer\n\nWe have applied this method to the problem of encoding binary trees which correspond\nto sentences of\nwords from a small vocabulary. Sentences had a \ufb01xed structure: a noun\nphrase, constituted of an adjective and a noun, followed by a verb phrase, made of a verb\nand a noun (see \ufb01g.3). Thus each sentence had a \ufb01xed grammatical structure, to which we\nadded some extra semantic structure in the following way. Words of each grammatical cat-\nor\n girl, woman, scientist\n ugly,\n. Our training set was con-\nand(\bU of the\n\u0011\u0013\u0007\t\u0005\u000b\n\nor in\n hurt, annoy\n\u0002\u0004\u0016\u0006\u0005\b\u0007\u0018\n\b\u0003\n\r\u001a\u0019\b\t\n\n\u0010\u0012\u000e\r\n\u0011\u001c\u0007\t\u0005\u000b\n\n\u0011\f\t\ntype\n, where the suf\ufb01x indicates the set to which\n\u000e \neach word type belongs. In this way, sentences of the kind \u201cpretty girl annoy scientist\u201d\npossible sentences that satis\ufb01ed\n\negory were divided into two disjoint sets. Nouns were in\u0001\nin\u0001\u0001\n help, love\nstituted by(\bU sentences of the type:\nwere not allowed in the training set, and there were (\ningV\n\nthe constraints which were implicit in the training set.\nWe used HLRE to learn a distributed representation for all the nodes in the trees, maximiz-\n\nfor the non-terminal symbols, given any root or internal node the system would reconstruct\nits children, and if they were non-terminal symbols would further decode each of them.\nThe decoding process would always halt providing the correct reconstruction for all the\n\nusing the*IU sentences in the training set. In 7D, after having built the outlier model\n*+U sentences in the training set. The top row of \ufb01g.4 shows the distributed representations\nfound for each word in the vocabulary. Notice how theT and\nare almost symmetric with respect to the origin; the difference between theT\n\nsets of adjectives and verbs\nsets\nis less evident for the nouns, due to the fact that while there exists a restriction on which\nnouns can be used in position\nin the training sentences (see \ufb01g.3, right). We tested how well this system could generalize\nbeyond the training set using the same procedure used by Pollack to enumerate the set of\ntrees that RAAMs are able to represent [6]: for every pair of patterns for trees, \ufb01rst we\nencoded them into a pattern for a new higher level tree, and then we decoded this tree back\ninto the patterns of the two sub-trees. If the norm of the difference between the original\n\n, there is no restriction on the nouns appearing in position\n\ncould be considered to be well formed. The system shows impressive generalization per-\n\n( , then the tree\nand the reconstructed sub-trees was within a tolerance, which we set toU\nformance: after training using the*+U sentences, the four-word sentences it generates are all\nthe(\n\nwell formed sentences, and only those. It does not generate sentences which are ei-\nther grammatically wrong, like \u201cdog old girl annoy\u201d, nor sentences which violate semantic\nconstraints, like \u201cpretty girl annoy scientist\u201d. This is striking when compared to the poor\ngeneralization performance obtained by the RAAM on similar problems. As recognized by\n\n\\\u0010\u000f\n\n\b\u0003\n\n\u001a\u0019\b\t\n\n\u0012\n\n\u0007\u0013\n\n\u0011\u0014\t\u0018\n\n\u0002\u0003\n\nand\n\n\u0007\u0006\u0007\n\nold\n\n\u0007\u0006\u0007\n\n\n\n\u000f\n\u001e\n\u0012\n\u000f\n\u001e\n\u0010\n \n\u0019\n\u0007\n\\\n\u0005\n\u0001\n\u0005\n\u0001\n\u0002\n\\\n\u0005\n\u0001\n\u0005\n\u0001\n\u0004\n\\\n\u0005\n\u0001\n\u0004\n\n\u0005\n\u0001\n\n\\\n\u0007\n\\\n\\\n\u0007\n\n \n\u0007\n\u0016\n\u0016\n\n\u0005\n\t\n\fAdjectives\n\nVerbs\n\nNouns\n\nC1 \u00b7\n\n(cid:13) C1 \u00b7\n\n(cid:13) girl\n\nC1 \u00b7\n\n(cid:13) C2 \u00b7\n\n(cid:13) girl\n\nC2 \u00b7\n\n(cid:13) C1 \u00b7\n\n(cid:13) girl\n\nC2 \u00b7\n\n(cid:13) C2 \u00b7\n\n(cid:13) girl\n\nR1 \u00b7\n\n(cid:13) R1 \u00b7\n\n(cid:13) Adjectives\n\n(cid:13) C1 \u00b7\n\n(cid:13) C1 \u00b7\n\nR1 \u00b7\n\n(cid:13) R1 \u00b7\n\n(cid:13) C1 \u00b7\n\n(cid:13) C2 \u00b7\n\n(cid:13) Nouns\n\nR1 \u00b7\n\n(cid:13) R1 \u00b7\n\n(cid:13) C2 \u00b7\n\n(cid:13) C1 \u00b7\n\n(cid:13) Verbs\n\nR1 \u00b7\n\n(cid:13) R1 \u00b7\n\n(cid:13) C2 \u00b7\n\n(cid:13) C2 \u00b7\n\n(cid:13) Nouns\n\nFigure 4: For Hinton diagrams with multiple rows, each row relates to a word, in the fol-\n; Nouns: 1=girl; 2=woman;\nlowing order - Adjectives: 1=pretty; 2=young; 3=ugly; 4=old\n3=scientist; 4=dog; 5=doctor; 6=lawyer\n; .\n(lower). Top row: The dis-\nBlack bars separate\ntributed representation of the words in the sentences found after learning. Center row:\nThe different contributions given to the root of the tree by the word \u201cgirl\u201d when placed in\nin the tree. Bottom row: The contribution of each leaf to the recon-\n,\nposition\nstruction of\nand\n\n; Verbs: 1=help; 2=love; 3=hurt; 4=annoy\n\n, when adjectives, nouns, verbs and nouns are applied in positions\n\n(higher), from\n\n,\u0001\n\n,\u0001\n\nand\n\n,\n\n,\n\n,\n\n,\n\n,\n\nrespectively.\n\nPollack [6], this was almost certainly due to the fact that for the RAAMs the representation\nfor the leaves was too similar, a problem that the HLRE formulation solves, since it learns\ntheir distributed representations.\n\nor\n\nand\n\u0015\u000f\u001bR\n\nLet us try to explain why HLRE can generalize so well. The matrix can be decomposed\ninto two sub-matrices,\nhave:\na given tree amounts to multiplying the representation of its leaves by all the\n\n, such that for any two children of a given node,\n, we\n, where \u201c;\u201d denotes the concatenation operator. Therefore\n, associated to each link in the\nwe have a pair of matrices, either\ngraph. Once the system has learned an embedding, \ufb01nding a distributed representation for\n\u001d matrices\nfound on all the paths from the leaves to the root, and adding them up. Luckily matrix\nmultiplication is non-commutative, and therefore every sequence of words on its leaves\ncan generate a different representation at the root node. The second row of \ufb01g.4 makes\nthis point clear showing the different contributions given to the root of the tree by the word\n\u201cgirl\u201d , depending on its position in the sentence. A tree can be \u201cunrolled\u201d from the root to\n\u001d matrices. We can now\n\nits leaves by multiplying its distributed representation using the\u0010\n\nanalyze how a particular leaf is reconstructed. Leaf\n\n, for example, is reconstructed as:\n\nand\n\n\u0002\u0018\n\n, when\nThe third row of \ufb01g.4 shows the contribution of each leaf to the reconstruction of\nadjectives, nouns, verbs and nouns are placed on leaves\nrespectively. We can\nsee that the contributions from the adjectives, match very closely their actual distributed\nrepresentations, while the contributions from the nouns in position\nare negligible. This\nmeans that any adjective placed on\nwill tend to be reconstructed correctly, and that its\n. On the other hand, the\nreconstruction is independent of the noun we have in position\n\nand\n\n,\n\n,\n\n\u0001\n\u0001\n\u0001\n\u0002\n\\\n\\\n\u0004\n\\\n\u0002\n\n\n\u0004\n\n\u0016\n\n\u0003\n\u0005\n\u0016\n\u0016\n\n\u0003\n\u0005\n\u000f\n\u001e\n\u0015\n\u0012\n$\nY\n\u0015\n\u0001\n\u0012\n[\n\u0005\n\n\u000f\n$\n\u001e\n$\n\u0012\n\u0002\n\n\u000f\n\u0011\n\u0010\n\u000f\n \n\u001e\n\u0011\n\u0010\n\u001e\n \n\u0016\n\u0002\n\u0016\n\u0005\n\u0010\n\u001e\n\u000f\n$\n\n\u001e\n\u000f\n$\n\u0016\n\u001b\n\u0010\n\u001e\n\u000f\n$\n\n\u000f\n$\n\n\u001e\n$\n\n\u001b\n\u0010\n\u001e\n\u000f\n$\n\n\u001e\n$\n\n\u000f\n$\n\u0003\n\u001b\n\u0010\n\u001e\n\u000f\n$\n\n\u001e\n\u001e\n$\n\u0005\n\u0016\n\u0016\n\n\u0003\n\u0005\n\u0005\n\u0016\n\u0005\n\fcontributions from nouns and verbs in positions\n\nthose given by words belonging to theT\n\nare non-negligible, and notice how\nsubsets are almost symmetric to those given by\nsubsets. In this way the system is able to enforce the semantic agreement\nwords in the\nbetween words in positions\n, when adjectives,\nnouns, verbs and nouns are not placed on leaves\nrespectively, assigns a very\nlow probability to any word, and thus the system does not generate sentences which are not\nwell formed.\n\nand\n\nand\n\n. Finally, the reconstruction of\n\n,\n\n,\n\n,\n\nand\n\n4 Conclusions\n\nLinear Relational Embedding is a new method for learning distributed representations of\nconcepts and relations from data consisting of instances of relations between given con-\ncepts. It \ufb01nds a mapping from the concepts into a feature-space by imposing the constraint\nthat relations in this feature-space are modeled by linear operations. LRE shows excellent\ngeneralization performance. The results on the Family Tree Problem are far better than\nthose obtained by any previously published method. Results on other problems are similar.\nMoreover we have shown elsewhere [4] that, after learning a distributed representation for\na set of concepts and relations, LRE can easily modify these representations to incorporate\nnew concepts and relations and that it is possible extract logical rules from the solution\nand to couple LRE with FOIL [7]. Learning is fast and LRE rarely converges to solutions\nwith poor generalization. We began introducing LRE for binary relations, and then we saw\nhow these ideas can be easily extended to higher arity relation by simply concatenating\nconcept vectors and using rectangular matrices for the relations. The compressor relation\nfor binary trees is a ternary relation; for trees of higher valence the compressor relation will\nhave higher arity. We have seen how HLRE can be used to \ufb01nd distributed representations\nfor hierarchical structures, and its generalization performance is much better than the one\nobtained using RAAMs on similar problems.\nIt is easy to prove that, when all the relations are binary, given a suf\ufb01cient number of\ndimensions, there always exists an LRE-type of solution that satis\ufb01es any set of triplets\n[4]. However, due to its linearity, LRE cannot represent some relations of arity greater\n\nthan* . This limitation can be overcome by adding an extra layer of non-linear units for\n\nrepresenting the relations. This new method, called Non-Linear Relational Embedding\n(NLRE) [4], can represent any relation and has given good generalization results.\n\nReferences\n\n[1] Geoffrey E. Hinton. Learning distributed representations of concepts.\n\nIn Proceedings of the\n\nEighth Annual Conference of the Cognitive Science Society, pages 1\u201312. Erlbaum, NJ, 1986.\n\n[2] Geoffrey E. Hinton, James L. McClelland, and David E. Rumelhart. Distributed representations.\nIn David E. Rumelhart, James L. McClelland, and the PDP research Group, editors, Parallel\nDistributed Processing, volume 1, pages 77\u2013109. The MIT Press, 1986.\n\n[3] Randall C. O\u2019Reilly. The LEABRA model of neural interactions and learning in the neocortex.\n\nPhD thesis, Department of Psychology, Carnegie Mellon University, 1996.\n\n[4] Alberto Paccanaro. Learning Distributed Representations of Relational Data using Linear Rela-\n\ntional Embedding. PhD thesis, Computer Science Department, University of Toronto, 2002.\n\n[5] Alberto Paccanaro and Geoffrey E. Hinton. Learning distributed representations by mapping\nIn Pat Langley, editor, Proceedings of ICML2000,\n\nconcepts and relations into a linear space.\npages 711\u2013718. Morgan Kaufmann, Stanford University, 2000.\n\n[6] Jordan B. Pollack. Recursive distributed representations. Arti\ufb01cial Intelligence, 46:77\u2013105,\n\n1990.\n\n[7] J. R. Quinlan. Learning logical de\ufb01nitions from relations. Machine Learning, 5:239\u2013266, 1990.\n\n\u0003\n\u0016\n\u0016\n\n\u0003\n\u0016\n\u0016\n\n\u0003\n\u0005\n\f", "award": [], "sourceid": 2068, "authors": [{"given_name": "Alberto", "family_name": "Paccanaro", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}