{"title": "Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 7211, "page_last": 7221, "abstract": "Machine understanding of complex images is a key goal of artificial intelligence. One challenge underlying this task is that visual scenes contain multiple inter-related objects, and that global context plays an important role in interpreting the scene. A natural modeling framework for capturing such effects is structured prediction, which optimizes over complex labels, while modeling within-label interactions. However, it is unclear what principles should guide the design of a structured prediction model that utilizes the power of deep learning components. Here we propose a design principle for such architectures that follows from a natural requirement of permutation invariance. We prove a necessary and sufficient characterization for architectures that follow this invariance, and discuss its implication on model design. Finally, we show that the resulting model achieves new state of the art results on the Visual Genome scene graph labeling benchmark, outperforming all recent approaches.", "full_text": "Mapping Images to Scene Graphs with\n\nPermutation-Invariant Structured Prediction\n\nRoei Herzig\u2217\n\nTel Aviv University\n\nroeiherzig@mail.tau.ac.il\n\nGal Chechik\n\nBar-Ilan University, NVIDIA Research\n\ngal.chechik@biu.ac.il\n\nMoshiko Raboh\u2217\nTel Aviv University\n\nmosheraboh@mail.tau.ac.il\n\nJonathan Berant\n\nTel Aviv University, AI2\n\njoberant@cs.tau.ac.il\n\nAmir Globerson\nTel Aviv University\n\ngamir@post.tau.ac.il\n\nAbstract\n\nMachine understanding of complex images is a key goal of arti\ufb01cial intelligence.\nOne challenge underlying this task is that visual scenes contain multiple inter-\nrelated objects, and that global context plays an important role in interpreting\nthe scene. A natural modeling framework for capturing such effects is structured\nprediction, which optimizes over complex labels, while modeling within-label\ninteractions. However, it is unclear what principles should guide the design of a\nstructured prediction model that utilizes the power of deep learning components.\nHere we propose a design principle for such architectures that follows from a\nnatural requirement of permutation invariance. We prove a necessary and suf\ufb01-\ncient characterization for architectures that follow this invariance, and discuss its\nimplication on model design. Finally, we show that the resulting model achieves\nnew state-of-the-art results on the Visual Genome scene-graph labeling benchmark,\noutperforming all recent approaches.\n\n1\n\nIntroduction\n\nUnderstanding the semantics of a complex visual scene is a fundamental problem in machine\nperception. It often requires recognizing multiple objects in a scene, together with their spatial and\nfunctional relations. The set of objects and relations is sometimes represented as a graph, connecting\nobjects (nodes) with their relations (edges) and is known as a scene graph (Figure 1). Scene graphs\nprovide a compact representation of the semantics of an image, and can be useful for semantic-level\ninterpretation and reasoning about a visual scene [11]. Scene-graph prediction is the problem of\ninferring the joint set of objects and their relations in a visual scene.\nSince objects and relations are inter-dependent (e.g., a person and chair are more likely to be in relation\n\u201csitting on\u201d than \u201ceating\u201d), a scene graph predictor should capture this dependence in order to improve\nprediction accuracy. This goal is a special case of a more general problem, namely, inferring multiple\ninter-dependent labels, which is the research focus of the \ufb01eld of structured prediction. Structured\nprediction has attracted considerable attention because it applies to many learning problems and poses\n\n\u2217Equal Contribution.\n\n32nd Conference on Neural Information Processing Systems (NIPS 2018), Montr\u00e9al, Canada.\n\n\fentities in the image (nodes, blue circles) like dog and their relations (edges, red circles) like(cid:10)hat, on, dog(cid:11).\n\nFigure 1: An image and its scene graph from the Visual Genome dataset [15]. The scene graph captures the\n\nunique theoretical and algorithmic challenges [e.g., see 2, 7, 28]. It is therefore a natural approach for\npredicting scene graphs from images.\nStructured prediction models typically de\ufb01ne a score function s(x, y) that quanti\ufb01es how well a\nlabel assignment y is compatible with an input x. In the case of understanding complex visual\nscenes, x is an image, and y is a complex label containing the labels of objects detected in an image\nand the labels of their relations. In this setup, the inference task amounts to \ufb01nding the label that\nmaximizes the compatibility score y\u2217 = arg maxy s(x, y). This score-based approach separates a\nscoring component \u2013 implemented by a parametric model, from an optimization component \u2013 aimed\nat \ufb01nding a label that maximizes that score. Unfortunately, for a general scoring function s(\u00b7), the\nspace of possible label assignments grows exponentially with input size. For instance, for scene\ngraphs the set of possible object label assignments is too large even for relatively simple images,\nsince the vocabulary of candidate objects may contain thousands of objects. As a result, inferring the\nlabel assignment that maximizes a scoring function is computationally hard in the general case.\nAn alternative approach to score-based methods is to map an input x to a structured output y with\na \u201cblack box\" neural network, without explicitly de\ufb01ning a score function. This raises a natural\nquestion: what is the right architecture for such a network? Here we take an axiomatic approach and\nargue that one important property such networks should satisfy is invariance to a particular type of\ninput permutation. We then prove that this invariance is equivalent to imposing certain structural\nconstraints on the architecture of the network, and describe architectures that satisfy these constraints.\nTo evaluate our approach, we \ufb01rst demonstrate on a synthetic dataset that respecting permutation\ninvariance is important, because models that violate this invariance need more training data, despite\nhaving a comparable model size. Then, we tackle the problem of scene graph generation. We describe\na model that satis\ufb01es the permutation invariance property, and show that it achieves state-of-the-art\nresults on the competitive Visual Genome benchmark [15], demonstrating the power of our new\ndesign principle.\nIn summary, the novel contributions of this paper are: a) Deriving suf\ufb01cient and necessary conditions\nfor graph-permutation invariance in deep structured prediction architectures. b) Empirically demon-\nstrating the bene\ufb01t of graph-permutation invariance. c) Developing a state-of-the-art model for scene\ngraph prediction on a large dataset of complex visual scenes.\n\n2 Structured Prediction\n\nscore functions previously used decompose as a sum over simpler functions, s(x, y) =(cid:80)\n\nScored-based methods in structured prediction de\ufb01ne a function s(x, y) that quanti\ufb01es the degree to\nwhich y is compatible with x, and infer a label by maximizing s(x, y) [e.g., see 2, 7, 16, 20, 28]. Most\ni fi(x, y),\nmaking it possible to optimize maxy fi(x, y) ef\ufb01ciently. This local maximization forms the basic\nbuilding block of algorithms for approximately maximizing s(x, y). One way to decompose the score\nfunction is to restrict each fi(x, y) to depend only on a small subset of the y variables.\nThe renewed interest in deep learning led to efforts to integrate deep networks with structured\nprediction, including modeling the fi functions as deep networks. In this context, the most widely-\nused score functions are singleton fi(yi, x) and pairwise fij(yi, yj, x). The early work taking this\napproach used a two-stage architecture, learning the local scores independently of the structured\nprediction goal [6, 8]. Later studies considered end-to-end architectures where the inference algorithm\n\n2\n\n\fFigure 2: Left: Graph permutation invariance. A graph labeling function F is graph permutation invariant\n(GPI) if permuting the node features maintains the output. Right: a schematic representation of the GPI\narchitecture in Theorem 1. Singleton features zi are omitted for simplicity. (a) First, the features zi,j are\nprocessed element-wise by \u03c6. (b) Features are summed to create a vector si, which is concatenated with zi. (c)\nA representation of the entire graph is created by applying \u03b1 n times and summing the created vector. (d) The\ngraph representation is then \ufb01nally processed by \u03c1 together with zk.\n\nis part of the computation graph [7, 23, 26, 33]. Recent studies go beyond pairwise scores, also\nmodelling global factors [2, 10].\nScore-based methods provide several advantages. First, they allow intuitive speci\ufb01cation of local\ndependencies between labels and how these translate to global dependencies. Second, for linear\nscore functions, the learning problem has natural convex surrogates [16, 28]. Third, inference in\nlarge label spaces is sometimes possible via exact algorithms or empirically accurate approximations.\nHowever, with the advent of deep scoring functions s(x, y; w), learning is no longer convex. Thus, it\nis worthwhile to rethink the architecture of structured prediction models, and consider models that\nmap inputs x to outputs y directly without explicitly maximizing a score function. We would like\nthese models to enjoy the expressivity and predictive power of neural networks, while maintaining\nthe ability to specify local dependencies between labels in a \ufb02exible manner. In the next section, we\npresent such an approach and consider a natural question: what should be the properties of a deep\nneural network used for structured prediction.\n\n3 Permutation-Invariant Structured Prediction\n\nIn what follows we de\ufb01ne the permutation-invariance property for structured prediction models, and\nargue that permutation invariance is a natural principle for designing their architecture.\nWe \ufb01rst introduce our notation. We focus on structures with pairwise interactions, because they are\nsimpler in terms of notation and are suf\ufb01cient for describing the structure in many problems. We\ndenote a structured label by y = [y1, . . . , yn]. In a score-based approach, the score is de\ufb01ned via a\nset of singleton scores fi(yi, x) and pairwise scores fij(yi, yj, x), where the overall score s(x, y) is\nthe sum of these scores. For brevity, we denote fij = fij(yi, yj, x) and fi = fi(yi, x). An inference\nalgorithm takes as input the local scores fi, fij and outputs an assignment that maximizes s(x, y).\nWe can thus view inference as a black-box that takes node-dependent and edge-dependent inputs\n(i.e., the scores fi, fij) and returns a label y, even without an explicit score function s(x, y). While\nnumerous inference algorithms exist for this setup, including belief propagation (BP) and mean \ufb01eld,\nhere we develop a framework for a deep labeling algorithm (we avoid the term \u201cinference\u201d since the\nalgorithm does not explicitly maximize a score function). Such an algorithm will be a black-box,\ntaking the f functions as input and the labels y1, . . . , yn as output. We next ask what architecture\nsuch an algorithm should have.\nWe follow with several de\ufb01nitions. A graph labeling function F : (V, E) \u2192 Y is a function whose\ninput is an ordered set of node features V = [z1, . . . , zn] and an ordered set of edge features\nE = [z1,2 . . . , zi,j, . . . , zn,n\u22121]. For example, zi can be the array of values fi, and zi,j can be\nthe table of values fi,j. Assume zi \u2208 Rd and zi,j \u2208 Re. The output of F is a set of node labels\ny = [y1, . . . , yn]. Thus, algorithms such as BP are graph labeling functions. However, graph labeling\nfunctions do not necessarily maximize a score function. We denote the joint set of node features and\nedge features by z (i.e., a set of n + n(n \u2212 1) = n2 vectors). In Section 3.1 we discuss extensions to\nthis case where only a subset of the edges is available.\n\n3\n\n\f1, y\u2217\n\n1, y\u2217\n\n2, y\u2217\n\n2, y\u2217\n\nA natural requirement is that the function F produces the same result when given the same features,\nup to a permutation of the input. For example, consider a label space with three variables y1, y2, y3,\nand assume that F takes as input z = (z1, z2, z3, z12, z13, z23) = (f1, f2, f3, f12, f13, f23), and\n3). When F is given an input that is permuted in a consistent way, say,\noutputs a label y = (y\u2217\nz(cid:48) = (f2, f1, f3, f21, f23, f13), this de\ufb01nes exactly the same input. Hence, the output should still be\ny = (y\u2217\n3). Most inference algorithms, including BP and mean \ufb01eld, satisfy this symmetry\nrequirement by design, but this property is not guaranteed in general in a deep model. Here, our\ngoal is to design a deep learning black-box, and hence we wish to guarantee invariance to input\npermutations. A black-box that violates this invariance \u201cwastes\u201d capacity on learning it at training\ntime, which increases sample complexity, as shown in Sec. 5.1. We proceed to formally de\ufb01ne the\npermutation invariance property.\nDe\ufb01nition 1. Let z be a set of node features and edge features, and let \u03c3 be a permutation of\n{1, . . . , n}. We de\ufb01ne \u03c3(z) to be a new set of node and edge features given by [\u03c3(z)]i = z\u03c3(i) and\n[\u03c3(z)]i,j = z\u03c3(i),\u03c3(j).\n\nWe also use the notation \u03c3([y1, . . . , yn]) = [y\u03c3(1), . . . , y\u03c3(n)] for permuting the labels. Namely, \u03c3\napplied to a set of labels yields the same labels, only permuted by \u03c3. Be aware that applying \u03c3 to\nthe input features is different from permuting labels, because edge input features must permuted in a\nway that is consistent with permuting node input features. We now provide our key de\ufb01nition of a\nfunction whose output is invariant to permutations of the input. See Figure 2 (left).\nDe\ufb01nition 2. A graph labeling function F is said to be graph-permutation invariant (GPI), if for\nall permutations \u03c3 of {1, . . . , n} and for all z it satis\ufb01es: F(\u03c3(z)) = \u03c3(F(z)).\n\n3.1 Characterizing Permutation Invariance\n\nMotivated by the above discussion, we ask: what structure is necessary and suf\ufb01cient to guarantee\nthat F is GPI? Note that a function F takes as input an ordered set z. Therefore its output on z\ncould certainly differ from its output on \u03c3(z). To achieve permutation invariance, F should contain\ncertain symmetries. For instance, one permutation invariant architecture could be to de\ufb01ne yi = g(zi)\nfor any function g, but this architecture is too restrictive and does not cover all permutation invariant\nfunctions. Theorem 1 below provides a complete characterization (see Figure 2 for the corresponding\narchitecture). Intuitively, the architecture in Theorem 1 is such that it can aggregate information from\nthe entire graph, and do so in a permutation invariant manner.\nTheorem 1. Let F be a graph labeling function. Then F is graph-permutation invariant if and only\nif there exist functions \u03b1, \u03c1, \u03c6 such that for all k = 1, . . . , n:\n\n\uf8eb\uf8edzk,\n\n\uf8eb\uf8edzi,\n\n(cid:88)\n\nj(cid:54)=i\n\nn(cid:88)\n\ni=1\n\n\u03b1\n\n[F(z)]k = \u03c1\n\n\uf8f6\uf8f8\uf8f6\uf8f8 ,\n\n\u03c6(zi, zi,j, zj)\n\n(1)\n\nwhere \u03c6 : R2d+e \u2192 RL, \u03b1 : Rd+L \u2192 RW and \u03c1 : RW +d \u2192 R.\nProof. First, we show that any F satisfying the conditions of Theorem 1 is GPI. Namely, for any\npermutation \u03c3, [F(\u03c3(z))]k = [F(z)]\u03c3(k). To see this, write [F(\u03c3(z))]k using Eq. 1 and De\ufb01nition 1:\n(2)\n\n[F(\u03c3(z))]k = \u03c1(z\u03c3(k),\n\n\u03c6(z\u03c3(i), z\u03c3(i),\u03c3(j), z\u03c3(j)))).\n\n(cid:88)\n\n(cid:88)\n\n\u03b1(z\u03c3(i),\n\ni\n\nj(cid:54)=i\n\nThe second argument of \u03c1 above is invariant under \u03c3, because it is a sum over nodes and their\nneighbors, which is invariant under permutation. Thus Eq. 2 is equal to:\n\n\u03c1(z\u03c3(k),\n\n\u03b1(zi,\n\n\u03c6(zi, zi,j, zj))) = [F(z)]\u03c3(k)\n\n(cid:88)\n\ni\n\n(cid:88)\n\nj(cid:54)=i\n\nwhere equality follows from Eq. 1. We thus proved that Eq. 1 implies graph permutation invariance.\nNext, we prove that any given GPI function F0 can be expressed as a function F in Eq. 1. Namely,\nwe show how to de\ufb01ne \u03c6, \u03b1 and \u03c1 that can implement F0. Note that in this direction of the proof the\nfunction F0 is a black-box. Namely, we only know that it is GPI, but do not assume anything else\nabout its implementation.\n\n4\n\n\fThe key idea is to construct \u03c6, \u03b1 such that the second argument of \u03c1 in Eq. 1 contains the information\nabout all the graph features z. Then, the function \u03c1 corresponds to an application of F0 to this\nrepresentation, followed by extracting the label yk. To simplify notation assume edge features are\nscalar (e = 1). The extension to vectors is simple, but involves more indexing.\nWe assume WLOG that the black-box function F0 is a function only of the pairwise features zi,j\n(otherwise, we can always augment the pairwise features with the singleton features). Since zi,j \u2208 R\nwe use a matrix Rn,n to denote all the pairwise features.\nFinally, we assume that our implementation of F0 will take additional node features zk such that no\ntwo nodes have the same feature (i.e., the features identify the node).\nOur goal is thus to show that there exist functions \u03b1, \u03c6, \u03c1 such that the function in Eq. 2 applied to Z\nyields the same labels as F0(Z).\nLet H be a hash function with L buckets mapping node features zi to an index (bucket). Assume\nthat H is perfect (this can be achieved for a large enough L). De\ufb01ne \u03c6 to map the pairwise\nfeatures to a vector of size L. Let 1 [j] be a one-hot vector of dimension RL, with one in the\njth coordinate. Recall that we consider scalar zi,j so that \u03c6 is indeed in RL, and de\ufb01ne \u03c6 as:\n\u03c6(zi, zi,j, zj) = 1 [H(zj)] zi,j, i.e., \u03c6 \u201cstores\u201d zi,j in the unique bucket for node j.\n\nzi,j\u2208E \u03c6(zi, zi,j, zj) be the second argument of \u03b1 in Eq. 1 (si \u2208 RL). Then, since all\nzj are distinct, si stores all the pairwise features for neighbors of i in unique positions within its\nL coordinates. Since si(H(zk)) contains the feature zi,k whereas sj(H(zk)) contains the feature\nzj,k, we cannot simply sum the si, since we would lose the information of which edges the features\noriginated from. Instead, we de\ufb01ne \u03b1 to map si to RL\u00d7L such that each feature is mapped to a\ndistinct location. Formally:\n\nLet si =(cid:80)\n\n\u03b1(zi, si) = 1 [H(zi)] sT\ni\n\n.\n\nrow H(zi). The matrix M =(cid:80)\n\n(3)\n\u03b1 outputs a matrix that is all zeros except for the features corresponding to node i that are stored in\ni \u03b1(zi, si) (namely, the second argument of \u03c1 in Eq. 1) is a matrix\nwith all the edge features in the graph including the graph structure.\nTo complete the construction we set \u03c1 to have the same outcome as F0. We \ufb01rst discard rows and\ncolumns in M that do not correspond to original nodes (reducing M to dimension n \u00d7 n). Then, we\nuse the reduced matrix as the input z to the black-box F0.\nAssume for simplicity that M does not need to be contracted (this merely introduces another indexing\nstep). Then M corresponds to the original matrix Z of pairwise features, with both rows and\ncolumns permuted according to H. We will thus use M as input to the function F0. Since F0 is\nGPI, this means that the label for node k will be given by F0(M ) in position H(zk). Thus we set\n\u03c1(zk, M ) = [F0(M )]H(zk), and by the argument above this equals [F0(Z)]k, implying that the\nabove \u03b1, \u03c6 and \u03c1 indeed implement F0.\nExtension to general graphs So far, we discussed complete graphs, where edges correspond\nto valid feature pairs. However, many graphs of interest might be incomplete. For example, an\nn-variable chain graph in sequence labeling has only n \u2212 1 edges. For such graphs, the input to F\nwould not contain all zi,j pairs but rather only features corresponding to valid edges of the graph, and\nwe are only interested in invariances that preserve the graph structure, namely, the automorphisms\nof the graph. Thus, the desired invariance is that \u03c3(F(z)) = F(\u03c3(z)), where \u03c3 is not an arbitrary\npermutation but an automorphism. It is easy to see that a simple variant of Theorem 1 holds in\nj\u2208N (i), where N (i) are the\n\nthis case. All we need to do is replace in Eq. 2 the sum(cid:80)\n\nj(cid:54)=i with(cid:80)\n\nneighbors of node i in the graph. The arguments are then similar to the proof above.\n\nImplications of Theorem 1 Our result has interesting implications for deep structured prediction.\nFirst, it highlights that the fact that the architecture \u201ccollects\u201d information from all different edges\nof the graph, in an invariant fashion via the \u03b1, \u03c6 functions. Speci\ufb01cally, the functions \u03c6 (after\nsummation) aggregate all the features around a given node, and then \u03b1 (after summation) can\ncollect them. Thus, these functions can provide a summary of the entire graph that is suf\ufb01cient for\ndownstream algorithms. This is different from one round of message passing algorithms which would\nnot be suf\ufb01cient for collecting global graph information. Note that the dimensions of \u03c6, \u03b1 may need\nto be large to aggregate all graph information (e.g., by hashing all the features as in the proof of\nTheorem 1), but the architecture itself can be shallow.\n\n5\n\n\fSecond, the architecture is parallelizable, as all \u03c6 functions can be applied simultaneously. This is in\ncontrast to recurrent models [32] which are harder to parallelize and are thus slower in practice.\nFinally, the theorem suggests several common architectural structures that can be used within GPI.\nWe brie\ufb02y mention two of these. 1) Attention: Attention is a powerful component in deep learning\narchitectures [1], but most inference algorithms do not use attention. Intuitively, in attention each\nnode i aggregates features of neighbors through a weighted sum, where the weight is a function of the\nneighbor\u2019s relevance. For example, the label of an entity in an image may depend more strongly on\nentities that are spatially closer. Attention can be naturally implemented in our GPI characterization,\nand we provide a full derivation for this implementation in the appendix. It plays a key role in our\nscene graph model described below. 2) RNNs: Because GPI functions are closed under composition,\nfor any GPI function F we can run F iteratively by providing the output of one step of F as part of\nthe input to the next step and maintain GPI. This results in a recurrent architecture, which we use in\nour scene graph model.\n\n4 Related Work\n\nThe concept of architectural invariance was recently proposed in DEEPSETS [31]. The invariance we\nconsider is much less restrictive: the architecture does not need to be invariant to all permutations of\nsingleton and pairwise features, just those consistent with a graph re-labeling. This characterization\nresults in a substantially different set of possible architectures.\nDeep structured prediction. There has been signi\ufb01cant recent interest in extending deep learning\nto structured prediction tasks. Much of this work has been on semantic segmentation, where\nconvolutional networks [27] became a standard approach for obtaining \u201csingleton scores\u201d and various\napproaches were proposed for adding structure on top. Most of these approaches used variants of\nmessage passing algorithms, unrolled into a computation graph [29]. Some studies parameterized\nparts of the message passing algorithm and learned its parameters [18]. Recently, gradient descent has\nalso been used for maximizing score functions [2, 10]. An alternative to deep structured prediction is\ngreedy decoding, inferring each label at a time based on previous labels. This approach has been\npopular in sequence-based applications (e.g., parsing [5]), relying on the sequential structure of\nthe input, where BiLSTMs are effectively applied. Another related line of work is applying deep\nlearning to graph-based problems, such as TSP [3, 9, 13]. Clearly, the notion of graph invariance\nis important in these, as highlighted in [9]. They however do not specify a general architecture that\nsatis\ufb01es invariance as we do here, and in fact focus on message passing architectures, which we\nstrictly generalize. Furthermore, our focus is on the more general problem of structured prediction,\nrather than speci\ufb01c graph-based optimization problems.\nScene graph prediction. Extracting scene graphs from images provides a semantic representation\nthat can later be used for reasoning, question answering, and image retrieval [12, 19, 25]. It is at the\nforefront of machine vision research, integrating challenges like object detection, action recognition\nand detection of human-object interactions [17, 24]. Prior work on scene graph predictions used\nneural message passing algorithms [29] as well as prior knowledge in the form of word embeddings\n[19]. Other work suggested to predict graphs directly from pixels in an end-to-end manner [21].\nNeuralMotif [32], currently the state-of-the-art model for scene graph prediction on Visual Genome,\nemploys an RNN that provides global context by sequentially reading the independent predictions for\neach entity and relation and then re\ufb01nes those predictions. The NEURALMOTIF model maintains\nGPI by \ufb01xing the order in which the RNN reads its inputs and thus only a single order is allowed.\nHowever, this \ufb01xed order is not guaranteed to be optimal.\n\n5 Experimental Evaluation\n\nWe empirically evaluate the bene\ufb01t of GPI architectures. First, using a synthetic graph-labeling task,\nand then for the problem of mapping images to scene graphs.\n\n5.1 Synthetic Graph Labeling\n\nWe start with studying GPI on a synthetic problem, de\ufb01ned as follows. An input graph G = (V, E)\nis given, where each node i \u2208 V is assigned to one of K sets. The set for node i is denoted by\n\n6\n\n\fFigure 3: Accuracy as a function of sample size for graph labeling. Right is a zoomed in version of left.\n\nNamely, the label of a node is yi =(cid:80)\n\n\u0393(i). The goal is to compute for each node the number of neighbors that belong to the same set.\nj\u2208N (i) 1[\u0393(i) = \u0393(j)]. We generated random graphs with 10\nnodes (larger graphs produced similar results) by sampling each edge independently and uniformly,\nand sampling \u0393(i) for every node uniformly from {1, . . . , K}. The node features zi \u2208 {0, 1}K are\none-hot vectors of \u0393(i) and the edge features zi,j \u2208 {0, 1} indicate whether ij \u2208 E. We compare two\nstandard non-GPI architectures and one GPI architecture: (a) A GPI-architecture for graph prediction,\ndescribed in detail in Section 5.2. We used the basic version without attention and RNN. (b) LSTM:\n\nWe replace(cid:80) \u03c6(\u00b7) and(cid:80) \u03b1(\u00b7), which perform aggregation in Theorem 1 with two LSTMs with\n\na state size of 200 that read their input in random order. (c) A fully-connected (FC) feed-forward\nnetwork with 2 hidden layers of 1000 nodes each. The input to the fully connected model is a\nconcatenation of all node and pairwise features. The output is all node predictions. The focus of the\nexperiment is to study sample complexity. Therefore, for a fair comparison, we use the same number\nof parameters for all models.\nFigure 3, shows the results, demonstrating that GPI requires far fewer samples to converge to the\ncorrect solution. This illustrates the advantage of an architecture with the correct inductive bias for\nthe problem.\n\n5.2 Scene-Graph Classi\ufb01cation\n\nWe evaluate the GPI approach on the motivating task of this paper, inferring scene graphs from\nimages (Figure 1). In this problem, the input is an image annotated with a set of bounding boxes for\nthe entities in the image.2 The goal is to label each bounding box with the correct entity category and\nevery pair of entities with their relation, such that they form a coherent scene graph.\nWe begin by describing our Scene Graph Predictor (SGP) model. We aim to predict two types of\nvariables. The \ufb01rst is entity variables [y1, . . . , yn] for all bounding boxes. Each yi can take one of\nL values (e.g., \u201cdog\u201d, \u201cman\u201d). The second is relation variables [yn+1, . . . , yn2] for every pair of\nbounding boxes. Each such yj can take one of R values (e.g., \u201con\u201d, \u201cnear\u201d). Our graph connects\nvariables that are expected to be inter-related. It contains two types of edges: 1) entity-entity edge\nconnecting every two entity variables (yi and yj for 1 \u2264 i (cid:54)= j \u2264 n. 2) entity-relation edges\nconnecting every relation variable yk (where k > n) to its two entity variables. Thus, our graph is not\na complete graph and our goal is to design an architecture that will be invariant to any automorphism\nof the graph, such as permutations of the entity variables.\nFor the input features z, we used the features learned by the baseline model from [32].3 Speci\ufb01cally,\nthe entity features zi included (1) The con\ufb01dence probabilities of all entities for yi as learned by the\nbaseline model. (2) Bounding box information given as (left, bottom, width, height);\n(3) The number of smaller entities (also bigger); (4) The number of entities to the left, right, above\nand below. (5) The number of entities with higher and with lower con\ufb01dence; (6) For the linguistic\nmodel only: word embedding of the most probable class. Word vectors were learned with GLOVE\nfrom the ground-truth captions of Visual Genome.\nSimilarly, the relation features zj \u2208 RR contained the probabilities of relation entities for the relation\nj. For the Linguistic model, these features were extended to include word embedding of the most\nprobable class. For entity-entity pairwise features zi,j, we use the relation probability for each pair.\n\n2For simplicity, we focus on the task where boxes are given.\n3The baseline does not use any LSTM or context, and is thus unrelated to the main contribution of [32].\n\n7\n\n\fConstrained Evaluation\n\nSGCls\n\nPredCls\n\nUnconstrained Evaluation\nSGCls\nPredCls\n\nR@50 R@100 R@50 R@100 R@50 R@100 R@50 R@100\n11.8\n21.7\n\n35.0\n44.8\n\n27.9\n53.0\n\n14.1\n24.4\n\n-\n-\n\nLu et al., 2016 [19]\nXu et al., 2017 [29]\nPixel2Graph [21]\nGraph R-CNN [30]\nNeural Motifs [32]\nBaseline [32]\nNo Attention\nNeighbor Attention\nLinguistic\n\n-\n\n29.6\n35.8\n34.6\n35.3\n35.7\n36.5\n\n-\n\n31.6\n36.5\n35.3\n37.2\n38.5\n38.8\n\n-\n\n54.2\n65.2\n63.7\n64.5\n64.6\n65.1\n\n-\n\n59.1\n67.1\n65.6\n66.3\n66.6\n66.9\n\n-\n-\n\n-\n\n26.5\n\n44.5\n43.4\n44.1\n44.7\n45.5\n\n-\n-\n\n-\n\n30.0\n\n47.7\n46.6\n48.5\n49.9\n50.8\n\n-\n-\n\n-\n\n68.0\n\n81.1\n78.8\n79.7\n80.0\n80.8\n\n75.2\n\n-\n\n88.3\n85.9\n86.7\n87.1\n88.2\n\nTable 1: Test set results for graph-constrained evaluation (i.e., the returned triplets must be consistent with a\nscene graph) and for unconstrained evaluation (triplets need not be consistent with a scene graph).\n\nBecause the output of SGP are probability distributions over entities and relations, we use them as an\nthe input z to SGP, once again in a recurrent manner and maintain GPI.\nWe next describe the main components of the GPI architecture. First, we focus on the parts that\noutput the entity labels. \u03c6ent is the network that integrates features for two entity variables yi\nand yj. It simply takes zi, zj and zi,j as input, and outputs a vector of dimension n1. Next, the\nnetwork \u03b1ent takes as input the outputs of \u03c6ent for all neighbors of an entity, and uses the attention\nmechanism described above to output a vector of dimension n2. Finally, the \u03c1ent network takes these\nn2 dimensional vectors and outputs L logits predicting the entity value. The \u03c1rel network takes as\ninput the \u03b1ent representation of the two entities, as well as zi,j and transforms the output into R\nlogits. See appendix for speci\ufb01c network architectures.\n\n5.2.1 Experimental Setup and Results\nDataset. We evaluated our approach on Visual Genome (VG) [15], a dataset with 108,077 images\nannotated with bounding boxes, entities and relations. On average, images have 12 entities and 7\nrelations per image. For a proper comparison with previous results [21, 29, 32], we used the data\nfrom [29], including the train and test splits. For evaluation, we used the same 150 entities and 50\nrelations as in [21, 29, 32]. To tune hyper-parameters, we also split the training data into two by\nrandomly selecting 5K examples, resulting in a \ufb01nal 70K/5K/32K split for train/validation/test sets.\nTraining. All networks were trained using Adam [14] with batch size 20. Hyperparameter values\nbelow were chosen based on the validation set. The SGP loss function was the sum of cross-entropy\nlosses over all entities and relations in the image. In the loss, we penalized entities 4 times more\nstrongly than relations, and penalized negative relations 10 times more weakly than positive relations.\nEvaluation.\nIn [29] three different evaluation settings were considered. Here we focus on two of\nthese: (1) SGCls: Given ground-truth bounding boxes for entities, predict all entity categories and\nrelations categories. (2) PredCls: Given bounding boxes annotated with entity labels, predict all\nrelations. Following [19], we used Recall@K as the evaluation metric. It measures the fraction of\ncorrect ground-truth triplets that appear within the K most con\ufb01dent triplets proposed by the model.\nTwo evaluation protocols are used in the literature differing in whether they enforce graph constraints\nover model predictions. The \ufb01rst graph-constrained protocol requires that the top-K triplets assign\none consistent class per entity and relation. The second unconstrained protocol does not enforce any\nsuch constraints. We report results on both protocols, following [32].\nModels and baselines. We compare four variants of our GPI approach with the reported results\nof four baselines that are currently the state-of-the-art on various scene graph prediction problems\n(all models use the same data split and pre-processing as [29]): 1) LU ET AL., 2016 [19]: This\nwork leverages word embeddings to \ufb01ne-tune the likelihood of predicted relations. 2) XU ET AL,\n2017 [29]: This model passes messages between entities and relations, and iteratively re\ufb01nes the\nfeature map used for prediction. 3) NEWELL & DENG, 2017 [21]: The PIXEL2GRAPH model\nuses associative embeddings [22] to produce a full graph from the image. 4) YANG ET AL., 2018\n[30]: The GRAPH R-CNN model uses object-relation regularities to sparsify and reason over scene\ngraphs. 5) ZELLERS ET AL., 2017 [32]: The NEURALMOTIF method encodes global context for\n\n8\n\n\fFigure 4: (a) An input image with bounding boxes from VG. (b) The ground-truth scene graph. (c) The\nBaseline fails to recognize some entities (tail and tree) and relations (in front of instead of looking at). (d)\nGPI:LINGUISTIC \ufb01xes most incorrect LP predictions. (e) Window is the most signi\ufb01cant neighbor of Tree. (f)\nThe entity bird receives substantial attention, while Tree and building are less informative.\n\ncapturing high-order motifs in scene graphs, and the BASELINE outputs the entities and relations\ndistributions without using the global context. The following variants of GPI were compared: 1) GPI:\nNO ATTENTION: Our GPI model, but with no attention mechanism. Instead, following Theorem\n1, we simply sum the features. 2) GPI: NEIGHBORATTENTION: Our GPI model, with attention\nover neighbors features. 3) GPI: LINGUISTIC: Same as GPI: NEIGHBORATTENTION but also\nconcatenating the word embedding vector, as described above.\n\nResults. Table 1 shows recall@50 and recall@100 for three variants of our approach, and compared\nwith \ufb01ve baselines. All GPI variants performs well, with LINGUISTIC outperforming all baselines\nfor SGCls and being comparable to the state-of-the-art model for PredCls. Note that PredCl is an\neasier task, which makes less use of the structure, hence it is not surprising that GPI achieves similar\naccuracy to [32]. Figure 4 illustrates the model behavior. Predicting isolated labels with zi (4c)\nmislabels several entities, but these are corrected at the \ufb01nal output (4d). Figure 4e shows that the\nsystem learned to attend more to nearby entities (the window and building are closer to the tree), and\n4f shows that stronger attention is learned for the class bird, presumably because it is usually more\ninformative than common classes like tree.\n\nImplementation details. The \u03c6 and \u03b1 networks were each implemented as a single fully-connected\n(FC) layer with a 500-dimensional outputs. \u03c1 was implemented as a FC network with 3 500-\ndimensional hidden layers, with one 150-dimensional output for the entity probabilities, and one\n51-dimensional output for relation probabilities. The attention mechanism was implemented as a\nnetwork like to \u03c6 and \u03b1, receiving the same inputs, but using the output scores for the attention . The\nfull code is available at https://github.com/shikorab/SceneGraph\n\n6 Conclusion\nWe presented a deep learning approach to structured prediction, which constrains the architecture\nto be invariant to structurally identical inputs. As in score-based methods, our approach relies on\npairwise features, capable of describing inter-label correlations, and thus inheriting the intuitive\naspect of score-based approaches. However, instead of maximizing a score function (which leads\nto computationally-hard inference), we directly produce an output that is invariant to equivalent\nrepresentations of the pairwise terms.\nThis axiomatic approach to model architecture can be extended in many ways. For image labeling,\ngeometric invariances (shift or rotation) may be desired.\nIn other cases, invariance to feature\npermutations may be desirable. We leave the derivation of the corresponding architectures to future\nwork. Finally, there may be cases where the invariant structure is unknown and should be discovered\nfrom data, which is related to work on lifting graphical models [4]. It would be interesting to explore\nalgorithms that discover and use such symmetries for deep structured prediction.\n\nAcknowledgements\nThis work was supported by the ISF Centers of Excellence grant, and by the Yandex Initiative in\nMachine Learning. Work by GC was performed while at Google Brain Research.\n\n9\n\n\fReferences\n[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. In International Conference on Learning Representations (ICLR), 2015.\n\n[2] David Belanger, Bishan Yang, and Andrew McCallum. End-to-end learning for structured\nprediction energy networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the\n34th International Conference on Machine Learning, volume 70, pages 429\u2013439. PMLR, 2017.\n\n[3] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combina-\n\ntorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.\n\n[4] Hung Hai Bui, Tuyen N. Huynh, and Sebastian Riedel. Automorphism groups of graphical\nmodels and lifted variational inference. In Proceedings of the Twenty-Ninth Conference on\nUncertainty in Arti\ufb01cial Intelligence, UAI\u201913, pages 132\u2013141, 2013.\n\n[5] Danqi Chen and Christopher Manning. A fast and accurate dependency parser using neural\nnetworks. In Proceedings of the 2014 conference on empirical methods in natural language\nprocessing (EMNLP), pages 740\u2013750, 2014.\n\n[6] Liang Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille.\nSemantic image segmentation with deep convolutional nets and fully connected CRFs. In\nProceedings of the Second International Conference on Learning Representations, 2014.\n\n[7] Liang Chieh Chen, Alexander G Schwing, Alan L Yuille, and Raquel Urtasun. Learning deep\n\nstructured models. In Proc. ICML, 2015.\n\n[8] Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning hierarchical\nfeatures for scene labeling. IEEE transactions on pattern analysis and machine intelligence,\n35(8):1915\u20131929, 2013.\n\n[9] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural\n\nmessage passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.\n\n[10] Michael Gygli, Mohammad Norouzi, and Anelia Angelova. Deep value networks learn to\nevaluate and iteratively re\ufb01ne structured outputs. In Doina Precup and Yee Whye Teh, editors,\nProceedings of the 34th International Conference on Machine Learning, volume 70 of Pro-\nceedings of Machine Learning Research, pages 1341\u20131351, International Convention Centre,\nSydney, Australia, 2017. PMLR.\n\n[11] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image generation from scene graphs. arXiv\n\npreprint arXiv:1804.01622, 2018.\n\n[12] Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S.\nBernstein, and Fei-Fei Li. Image retrieval using scene graphs. In Proc. Conf. Comput. Vision\nPattern Recognition, pages 3668\u20133678, 2015.\n\n[13] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial\noptimization algorithms over graphs. In Advances in Neural Information Processing Systems,\npages 6351\u20136361, 2017.\n\n[14] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv: 1412.6980, abs/1412.6980, 2014.\n\n[15] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie\nChen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting\nlanguage and vision using crowdsourced dense image annotations. International Journal of\nComputer Vision, 123(1):32\u201373, 2017.\n\n[16] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for\nsegmenting and labeling sequence data. In Proceedings of the 18th International Conference on\nMachine Learning, pages 282\u2013289, 2001.\n\n[17] Wentong Liao, Michael Ying Yang, Hanno Ackermann, and Bodo Rosenhahn. On support\n\nrelations and semantic scene graphs. arXiv preprint arXiv:1609.05834, 2016.\n\n10\n\n\f[18] Guosheng Lin, Chunhua Shen, Ian Reid, and Anton van den Hengel. Deeply learning the\nmessages in message passing inference. In Advances in Neural Information Processing Systems,\npages 361\u2013369, 2015.\n\n[19] Cewu Lu, Ranjay Krishna, Michael S. Bernstein, and Fei-Fei Li. Visual relationship detection\n\nwith language priors. In European Conf. Comput. Vision, pages 852\u2013869, 2016.\n\n[20] O. Meshi, D. Sontag, T. Jaakkola, and A. Globerson. Learning ef\ufb01ciently with approximate\ninference via dual losses. In Proceedings of the 27th International Conference on Machine\nLearning, pages 783\u2013790, New York, NY, USA, 2010. ACM.\n\n[21] Alejandro Newell and Jia Deng. Pixels to graphs by associative embedding. In Advances in\nNeural Information Processing Systems 30 (to appear), pages 1172\u20131180. Curran Associates,\nInc., 2017.\n\n[22] Alejandro Newell, Zhiao Huang, and Jia Deng. Associative embedding: End-to-end learning\nfor joint detection and grouping. In Neural Inform. Process. Syst., pages 2274\u20132284. Curran\nAssociates, Inc., 2017.\n\n[23] Wenzhe Pei, Tao Ge, and Baobao Chang. An effective neural network model for graph-\nbased dependency parsing. In Proceedings of the 53rd Annual Meeting of the Association for\nComputationa Linguistics, pages 313\u2013322, 2015.\n\n[24] Bryan A. Plummer, Arun Mallya, Christopher M. Cervantes, Julia Hockenmaier, and Svetlana\nLazebnik. Phrase localization and visual relationship detection with comprehensive image-\nlanguage cues. In ICCV, pages 1946\u20131955, 2017.\n\n[25] David Raposo, Adam Santoro, David Barrett, Razvan Pascanu, Timothy Lillicrap, and Peter\nBattaglia. Discovering objects and their relations from entangled scene representations. arXiv\npreprint arXiv:1702.05068, 2017.\n\n[26] Alexander G Schwing and Raquel Urtasun. Fully connected deep structured networks. ArXiv\n\ne-prints, 2015.\n\n[27] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semantic\n\nsegmentation. Proc. Conf. Comput. Vision Pattern Recognition, 39(4):640\u2013651, 2017.\n\n[28] B. Taskar, C. Guestrin, and D. Koller. Max margin Markov networks. In S. Thrun, L. Saul, and\nB. Sch\u00f6lkopf, editors, Advances in Neural Information Processing Systems 16, pages 25\u201332.\nMIT Press, Cambridge, MA, 2004.\n\n[29] Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. Scene Graph Generation by Iterative\nMessage Passing. In Proc. Conf. Comput. Vision Pattern Recognition, pages 3097\u20133106, 2017.\n\n[30] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph R-CNN for scene\n\ngraph generation. In European Conf. Comput. Vision, pages 690\u2013706, 2018.\n\n[31] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov,\nand Alexander J Smola. Deep sets. In Advances in Neural Information Processing Systems 30,\npages 3394\u20133404. Curran Associates, Inc., 2017.\n\n[32] Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph\n\nparsing with global context. arXiv preprint arXiv:1711.06640, abs/1711.06640, 2017.\n\n[33] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su,\nDalong Du, Chang Huang, and Philip HS Torr. Conditional random \ufb01elds as recurrent neural\nnetworks. In Proceedings of the IEEE International Conference on Computer Vision, pages\n1529\u20131537, 2015.\n\n11\n\n\f", "award": [], "sourceid": 3585, "authors": [{"given_name": "Roei", "family_name": "Herzig", "institution": "Tel Aviv University"}, {"given_name": "Moshiko", "family_name": "Raboh", "institution": "Tel Aviv University"}, {"given_name": "Gal", "family_name": "Chechik", "institution": "Google, BIU"}, {"given_name": "Jonathan", "family_name": "Berant", "institution": "Tel Aviv University"}, {"given_name": "Amir", "family_name": "Globerson", "institution": "Tel Aviv University, Google"}]}