{"title": "Pixels to Graphs by Associative Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 2171, "page_last": 2180, "abstract": "Graphs are a useful abstraction of image content. Not only can graphs represent details about individual objects in a scene but they can capture the interactions between pairs of objects. We present a method for training a convolutional neural network such that it takes in an input image and produces a full graph definition. This is done end-to-end in a single stage with the use of associative embeddings. The network learns to simultaneously identify all of the elements that make up a graph and piece them together. We benchmark on the Visual Genome dataset, and demonstrate state-of-the-art performance on the challenging task of scene graph generation.", "full_text": "Pixels to Graphs by Associative Embedding\n\nAlejandro Newell\nJia Deng\nComputer Science and Engineering\nUniversity of Michigan, Ann Arbor\n{alnewell, jiadeng}@umich.edu\n\nAbstract\n\nGraphs are a useful abstraction of image content. Not only can graphs represent\ndetails about individual objects in a scene but they can capture the interactions\nbetween pairs of objects. We present a method for training a convolutional neural\nnetwork such that it takes in an input image and produces a full graph de\ufb01nition.\nThis is done end-to-end in a single stage with the use of associative embeddings.\nThe network learns to simultaneously identify all of the elements that make up a\ngraph and piece them together. We benchmark on the Visual Genome dataset, and\ndemonstrate state-of-the-art performance on the challenging task of scene graph\ngeneration.\n\n1\n\nIntroduction\n\nExtracting semantics from images is one of the main goals of computer vision. Recent years have\nseen rapid progress in the classi\ufb01cation and localization of objects [7, 24, 10]. But a bag of labeled\nand localized objects is an impoverished representation of image semantics: it tells us what and where\nthe objects are (\u201cperson\u201d and \u201ccar\u201d), but does not tell us about their relations and interactions (\u201cperson\nnext to car\u201d). A necessary step is thus to not only detect objects but to identify the relations between\nthem. An explicit representation of these semantics is referred to as a scene graph [12] where we\nrepresent objects grounded in the scene as vertices and the relationships between them as edges.\nEnd-to-end training of convolutional networks has proven to be a highly effective strategy for image\nunderstanding tasks. It is therefore natural to ask whether the same strategy would be viable for\npredicting graphs from pixels. Existing approaches, however, tend to break the problem down into\nmore manageable steps. For example, one might run an object detection system to propose all of the\nobjects in the scene, then isolate individual pairs of objects to identify the relationships between them\n[18]. This breakdown often restricts the visual features used in later steps and limits reasoning over\nthe full graph and over the full contents of the image.\nWe propose a novel approach to this problem, where we train a network to de\ufb01ne a complete graph\nfrom a raw input image. The proposed supervision allows a network to better account for the full\nimage context while making predictions, meaning that the network reasons jointly over the entire\nscene graph rather than focusing on pairs of objects in isolation. Furthermore, there is no explicit\nreliance on external systems such as Region Proposal Networks (RPN) [24] that provide an initial\npool of object detections.\nTo do this, we treat all graph elements\u2014both vertices and edges\u2014as visual entities to be detected as\nin a standard object detection pipeline. Speci\ufb01cally, a vertex is an instance of an object (\u201cperson\u201d),\nand an edge is an instance of an object-object relation (\u201cperson next to car\u201d). Just as visual patterns\nin an image allow us to distinguish between objects, there are properties of the image that allow us to\nsee relationships. We train the network to pick up on these properties and point out where objects and\nrelationships are likely to exist in the image space.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Scene graphs are de\ufb01ned by the objects in an image (vertices) and their interactions (edges).\nThe ability to express information about the connections between objects make scene graphs a useful\nrepresentation for many computer vision tasks including captioning and visual question answering.\n\nWhat distinguishes this work from established detection approaches [24] is the need to represent\nconnections between detections. Traditionally, a network takes an image, identi\ufb01es the items of\ninterest, and outputs a pile of independent objects. A given detection does not tell us anything about\nthe others. But now, if the network produces a pool of objects (\u201ccar\u201d, \u201cperson\u201d, \u201cdog\u201d, \u201ctree\u201d, etc),\nand also identi\ufb01es a relationship such as \u201cin front of\u201d we need to de\ufb01ne which of the detected objects\nis in front of which. Since we do not know which objects will be found in a given image ahead of\ntime, the network needs to somehow refer to its own outputs.\nWe draw inspiration from associative embeddings [20] to solve this problem. Originally proposed for\ndetection and grouping in the context of multiperson pose estimation, associative embeddings provide\nthe necessary \ufb02exibility in the network\u2019s output space. For pose estimation, the idea is to predict\nan embedding vector for each detected body joint such that detections with similar embeddings can\nbe grouped to form an individual person. But in its original formulation, the embeddings are too\nrestrictive, the network can only de\ufb01ne clusters of nodes, and for a scene graph, we need to express\narbitrary edges between pairs of nodes.\nTo address this, associative embeddings must be used in a substantially different manner. That is,\nrather than having nodes output a shared embedding to refer to clusters and groups, we instead have\neach node de\ufb01ne its own unique embedding. Given a set of detected objects, the network outputs a\ndifferent embedding for each object. Now, each edge can refer to the source and destination nodes by\ncorrectly producing their embeddings. Once the network is trained it is straightforward to match the\nembeddings from detected edges to each vertex and construct a \ufb01nal graph.\nThere is one further issue that we address in this work: how to deal with detections grounded at the\nsame location in the image. Frequently in graph prediction, multiple vertices or edges may appear in\nthe same place. Supervision of this is dif\ufb01cult as training a network traditionally requires telling it\nexactly what appears and where. With an unordered set of overlapping detections there may not be a\ndirect mapping to explicitly lay this out. Consider a set of object relations grounded at the same pixel\nlocation. Assume the network has some \ufb01xed output space consisting of discrete \u201cslots\u201d in which\ndetections can appear. It is unclear how to de\ufb01ne a mapping so that the network has a consistent rule\nfor organizing its relation predictions into these slots. We address this problem by not enforcing any\nexplicit mapping at all, and instead provide supervision such that it does not matter how the network\nchooses to \ufb01ll its output, a correct loss can still be applied.\nOur contributions are a novel use of associative embeddings for connecting the vertices and edges of\na graph, and a technique for supervising an unordered set of network outputs. Together these form\nthe building blocks of our system for direct graph prediction from pixels. We apply our method to the\ntask of generating a semantic graph of objects and relations and test on the Visual Genome dataset\n[14]. We achieve state-of-the-art results improving performance over prior work by nearly a factor of\nthree on the most dif\ufb01cult task setting.\n\n2 Related Work\n\nRelationship detection: There are many ways to frame the task of identifying objects and the\nrelationships between them. This includes localization from referential expressions [11], detection\nof human-object interactions [3], or the more general tasks of visual relationship detection (VRD)\n[18] and scene graph generation [12]. In all of these settings, the aim is to correctly determine the\n\n2\n\n\frelationships between pairs of objects and ground this in the image with accurate object bounding\nboxes.\nVisual relationship detection has drawn much recent attention [18, 28, 27, 2, 17, 19, 22, 23]. The\nopen-ended and challenging nature of the task lends itself to a variety of diverse approaches and\nsolutions. For example: incorporating vision and language when reasoning over a pair of objects\n[18]; using message-passing RNNs to process a set of proposed object boxes [26]; predicting over\ntriplets of bounding boxes that corresponding to proposals for a subject, phrase, and object [15];\nusing reinforcement learning to sequentially evaluate on pairs of object proposals and determine their\nrelationships [16]; comparing the visual features and relative spatial positions of pairs of boxes [4];\nlearning to project proposed objects into a vector space such that the difference between two object\nvectors is informative of the relationship between them [27].\nMost of these approaches rely on generated bounding boxes from a Region Proposal Network (RPN)\n[24]. Our method does not require proposed boxes and can produce detections directly from the image.\nHowever proposals can be incorporated as additional input to improve performance. Furthermore,\nmany methods process pairs of objects in isolation whereas we train a network to process the whole\nimage and produce all object and relationship detections at once.\nAssociative Embedding: Vector embeddings are used in a variety of contexts. For example, to\nmeasure the similarity between pairs of images [6, 25], or to map visual and text features to a shared\nvector space [5, 8, 13]. Recent work uses vector embeddings to group together body joints for\nmultiperson pose estimation [20]. These are referred to as associative embeddings since supervision\ndoes not require the network to output a particular vector value, and instead uses the distances\nbetween pairs of embeddings to calculate a loss. What is important is not the exact value of the vector\nbut how it relates to the other embeddings produced by the network.\nMore speci\ufb01cally, in [20] a network is trained to detect body joints of the various people in an image.\nIn addition, it must produce a vector embedding for each of its detections. The embedding is used\nto identify which person a particular joint belongs to. This is done by ensuring that all joints that\nbelong to a single individual produce the same output embedding, and that the embeddings across\nindividuals are suf\ufb01ciently different to separate detections out into discrete groups. In a certain sense,\nthis approach does de\ufb01ne a graph, but the graph is restricted in that it can only represent clusters of\nnodes. For the purposes of our work, we take a different perspective on the associative embedding\nloss in order to express any arbitrary graph as de\ufb01ned by a set of vertices and directed edges. There\nare other ways that embeddings could be applied to solve this problem, but our approach depends on\nour speci\ufb01c formulation where we treat edges as elements of the image to be detected which is not\nobvious given the prior use of associative embeddings for pose.\n\n3 Pixels \u2192 Graph\n\nOur goal is to construct a graph from a set of pixels. In particular, we want to construct a graph\ngrounded in the space of these pixels. Meaning that in addition to identifying vertices of the graph,\nwe want to know their precise locations. A vertex in this case can refer to any object of interest in the\nscene including people, cars, clothing, and buildings. The relationships between these objects is then\ncaptured by the edges of the graph. These relationships may include verbs (eating, riding), spatial\nrelations (on the left of, behind), and comparisons (smaller than, same color as).\nMore formally we consider a directed graph G = (V, E). A given vertex vi \u2208 V is grounded at\na location (xi, yi) and de\ufb01ned by its class and bounding box. Each edge e \u2208 E takes the form\nei = (vs, vt, ri) de\ufb01ning a relationship of type ri from vs to vt. We train a network to explicitly\nde\ufb01ne V and E. This training is done end-to-end on a single network, allowing the network to reason\nfully over the image and all possible components of the graph when making its predictions.\nWhile production of the graph occurs all at once, it helps to think of the process in two main steps:\ndetecting individual elements of the graph, and connecting these elements together. For the \ufb01rst step,\nthe network indicates where vertices and edges are likely to exist and predicts the properties of these\ndetections. For the second, we determine which two vertices are connected by a detected edge. We\ndescribe these two steps in detail in the following subsections.\n\n3\n\n\fFigure 2: Full pipeline for object and relationship detection. A network is trained to produce two\nheatmaps that activate at the predicted locations of objects and relationships. Feature vectors are\nextracted from the pixel locations of top activations and fed through fully connected networks to\npredict object and relationship properties. Embeddings produced at this step serve as IDs allowing\ndetections to refer to each other.\n\n3.1 Detecting graph elements\n\n2\n\n2\n\n(cid:99),(cid:98) ys+yt\n\nFirst, the network must \ufb01nd all of the vertices and edges that make up a graph. Each graph element\nis grounded at a pixel location which the network must identify. In a scene graph where vertices\ncorrespond to object detections, the center of the object bounding box will serve as the grounding\nlocation. We ground edges at the midpoint of the source and target vertices: ((cid:98) xs+xt\n(cid:99)).\nWith this grounding in mind, we can detect individual elements by using a network that produces\nper-pixel features at a high output resolution. The feature vector at a pixel determines if an edge or\nvertex is present at that location, and if so is used to predict the properties of that element.\nA convolutional neural network is used to process the image and produce a feature tensor of size h x\nw x f. All information necessary to de\ufb01ne a vertex or edge is thus encoded at particular pixel in a\nfeature vector of length f . Note that even at a high output resolution, multiple graph elements may\nbe grounded at the same location. The following discussion assumes up to one vertex and edge can\nexist at a given pixel, and we elaborate on how we accommodate multiple detections in Section 3.3.\nWe use a stacked hourglass network [21] to process an image and produce the output feature tensor.\nWhile our method has no strict dependence on network architecture, there are some properties that are\nimportant for this task. The hourglass design combines global and local information to reason over\nthe full image and produce high quality per-pixel predictions. This is originally done for human pose\nprediction which requires global reasoning over the structure of the body, but also precise localization\nof individual joints. Similar logic applies to scene graphs where the context of the whole scene must\nbe taken into account, but we wish to preserve the local information of individual elements.\nAn important design choice here is the output resolution of the network. It does not have to match the\nfull input resolution, but there are a few details worth considering. First, it is possible for elements to\nbe grounded at the exact same pixel. The lower the output resolution, the higher the probability of\noverlapping detections. Our approach allows this, but the fewer overlapping detections, the better.\nAll information necessary to de\ufb01ne these elements must be encoded into a single feature vector of\nlength f which gets more dif\ufb01cult to do as more elements occupy a given location. Another detail is\nthat increasing the output resolution aids in performing better localization.\nTo predict the presence of graph elements we take the \ufb01nal feature tensor and apply a 1x1 convolution\nand sigmoid activation to produce two heatmaps (one for vertices and another for edges). Each\nheatmap indicates the likelihood that a vertex or edge exists at a given pixel. Supervision is a binary\ncross-entropy loss on the heatmap activations, and we threshold on the result to produce a candidate\nset of detections.\nNext, for each of these detections we must predict their properties such as their class label. We extract\nthe feature vector from the corresponding location of a detection, and use the vector as input to a\nset of fully connected networks. A separate network is used for each property we wish to predict,\nand each consists of a single hidden layer with f nodes. This is illustrated above in Figure 2. During\ntraining we use the ground truth locations of vertices and edges to extract features. A softmax loss is\nused to supervise labels like object class and relationship predicate. And to predict bounding box\ninformation we use anchor boxes and regress offsets based on the approach in Faster-RCNN [24].\n\n4\n\n\fIn summary, the detection pipeline works as follows: We pass the image through a network to produce\na set of per-pixel features. These features are \ufb01rst used to produce heatmaps identifying vertex and\nedge locations. Individual feature vectors are extracted from the top heatmap locations to predict the\nappropriate vertex and edge properties. The \ufb01nal result is a pool of vertex and edge detections that\ntogether will compose the graph.\n\n3.2 Connecting elements with associative embeddings\n\nNext, the various pieces of the graph need to be put together. This is made possible by training the\nnetwork to produce additional outputs in the same step as the class and bounding box prediction.\nFor every vertex, the network produces a unique identi\ufb01er in the form of a vector embedding, and\nfor every edge, it must produce the corresponding embeddings to refer to its source and destination\nvertices. The network must learn to ensure that embeddings are different across different vertices,\nand that all embeddings that refer to a single vertex are the same.\nThese embeddings are critical for explicitly laying out the de\ufb01nition of a graph. For instance, while\nit is helpful that edge detections are grounded at the midpoint of two vertices, this ultimately does\nnot address a couple of critical details for correctly constructing the graph. The midpoint does\nnot indicate which vertex serves as the source and which serves as the destination, nor does it\ndisambiguate between pairs of vertices that happen to share the same midpoint.\nTo train the network to produce a coherent set of embeddings we build off of the loss penalty used in\n[20]. During training, we have a ground truth set of annotations de\ufb01ning the unique objects in the\nscene and the edges between these objects. This allows us to enforce two penalties: that an edge points\nto a vertex by matching its output embedding as closely as possible, and that the embedding vectors\nproduced for each vertex are suf\ufb01ciently different. We think of the \ufb01rst as \u201cpulling together\u201d all\nreferences to a single vertex, and the second as \u201cpushing apart\u201d the references to different individual\nvertices.\nWe consider an embedding hi \u2208 Rd produced for a vertex vi \u2208 V . All edges that connect to this\nvertex produce a set of embeddings h(cid:48)\nik, k = 1, ..., Ki where Ki is the total number of references to\nthat vertex. Given an image with n objects the loss to \u201cpull together\u201d these embeddings is:\n\n1(cid:80)n\n\ni=1 Ki\n\nLpull =\n\nn(cid:88)\n\nKi(cid:88)\n\ni=1\n\nk=1\n\n(hi \u2212 h(cid:48)\n\nik)2\n\nTo \u201cpush apart\u201d embeddings across different vertices we \ufb01rst used the penalty described in [20],\nbut experienced dif\ufb01culty with convergence. We tested alternatives and the most reliable loss was a\nmargin-based penalty similar to [9]:\n\nn\u22121(cid:88)\n\nn(cid:88)\n\nLpush =\n\nmax(0, m \u2212 ||hi \u2212 hj||)\n\ni=1\n\nj=i+1\n\ni is from hi.\n\nIntuitively, Lpush is at its highest the closer hi and hj are to each other. The penalty drops off sharply\nas the distance between hi and hj grows, eventually hitting zero once the distance is greater than a\ngiven margin m. On the \ufb02ip side, for some edge connected to a vertex vi, the loss Lpull will quickly\ngrow the further its reference embedding h(cid:48)\nThe two penalties are weighted equally leaving a \ufb01nal associative embedding loss of Lpull + Lpush.\nIn this work, we use m = 8 and d = 8. Convergence of the network improves greatly after increasing\nthe dimension d of tags up from 1 as used in [20].\nOnce the network is trained with this loss, full construction of the graph can be performed with a\ntrivial postprocessing step. The network produces a pool of vertex and edge detections. For every\nedge, we look at the source and destination embeddings and match them to the closest embedding\namongst the detected vertices. Multiple edges may have the same source and target vertices, vs and\nvt, and it is also possible for vs to equal vt.\n\n5\n\n\f3.3 Support for overlapping detections\n\nIn scene graphs, there are going to be many cases where multiple vertices or multiple edges will be\ngrounded at the same pixel location. For example, it is common to see two distinct relationships\nbetween a single pair of objects: person wearing shirt \u2014 shirt on person. The detection pipeline\nmust therefore be extended to support multiple detections at the same pixel.\nOne way of dealing with this is to de\ufb01ne an extra axis that allows for discrete separation of detections\nat a given x, y location. For example, one could split up objects along a third spatial dimension\nassuming the z-axis were annotated, or perhaps separate them by bounding box anchors. In either of\nthese cases there is a visual cue guiding the network so that it can learn a consistent rule for assigning\nnew detections to a correct slot in the third dimension. Unfortunately this idea cannot be applied as\neasily to relationship detections. It is unclear how to de\ufb01ne a third axis such that there is a reliable\nand consistent bin assignment for each relationship.\nIn our approach, we still separate detections out into several discrete bins, but address the issue of\nassignment by not enforcing any speci\ufb01c assignment at all. This means that for a given detection we\nstrictly supervise the x, y location in which it is to appear, but allow it to show up in one of several\n\u201cslots\u201d. We have no way of knowing ahead of time in which slot it will be placed by the network, so\nthis means an extra step must be taken at training time to identify where we think the network has\nplaced its predictions and then enforce the loss at those slots.\nWe de\ufb01ne so and sr to be the number of slots available for objects and relationships respectively.\nWe modify the network pipeline so that instead of producing predictions for a single object and\nrelationship at a pixel, a feature vector is used to produce predictions for a set of so objects and sr\nrelationships. That is, given a feature vector f from a single pixel, the network will for example\noutput so object class labels, so bounding box predictions, and so embeddings. This is done with\nseparate fully connected layers predicting the various object and relationship properties for each\navailable slot. No weights are shared amongst these layers. Furthermore, we add an additional output\nto serve as a score indicating whether or not a detection exists at each slot.\nDuring training, we have some number of ground truth objects, between 1 and so, grounded at a\nparticular pixel. We do not know which of the so outputs of the network will correspond to which\nobjects, so we must perform a matching step. The network produces distributions across possible\nobject classes and bounding box sizes, so we try to best match the outputs to the ground truth\ninformation we have available. We construct a reference vector by concatenating one-hot encodings\nof the class and bounding box anchor for a given object. Then we compare these reference vectors to\nthe output distributions produced at each slot. The Hungarian method is used to perform a maximum\nmatching step such that ground truth annotations are assigned to the best possible slot, but no two\nannotations are assigned to the same slot.\nMatching for relationships is similar. The ground truth reference vector is constructed by concate-\nnating a one-hot encoding of its class with the output embeddings hs and ht from the source and\ndestination vertices, vs and vt. Once the best matching has been determined we have a correspon-\ndence between the network predictions and the set of ground truth annotations and can now apply the\nvarious losses. We also supervise the score for each slot depending on whether or not it is matched\nup to a ground truth detection - thus teaching the network to indicate a \u201cfull\u201d or \u201cempty\u201d slot.\nThis matching process is only used during training. At test time, we extract object and relationship\ndetections from the network by \ufb01rst thresholding on the heatmaps to \ufb01nd a set of candidate pixel\nlocations, and then thresholding on individual slot scores to see which slots have produced detections.\n\n4\n\nImplementation details\n\nWe train a stacked hourglass architecture [21] in TensorFlow [1]. The input to the network is a\n512x512 image, with an output resolution of 64x64. To prepare an input image we resize it is so that\nits largest dimension is of length 512, and center it by padding with zeros along the other dimension.\nDuring training, we augment this procedure with random translation and scaling making sure to\nupdate the ground truth annotations to ignore objects and relationships that may be cropped out.\nWe make a slight modi\ufb01cation to the orginal hourglass design: doubling the number of features to\n512 at the two lowest resolutions of the hourglass. The output feature length f is 256. All losses -\nclassi\ufb01cation, bounding box regression, associative embedding - are weighted equally throughout\n\n6\n\n\fFigure 3: Predictions on Visual Genome. In the top row, the network must produce all object and\nrelationship detections directly from the image. The second row includes examples from an easier\nversion of the task where object detections are provided. Relationships outlined in green correspond\nto predictions that correctly matched to a ground truth annotation.\n\nthe course of training. We set so = 3 and sr = 6 which is suf\ufb01cient to completely accommodate the\ndetection annotations for all but a small fraction of cases.\nIncorporating prior detections: In some problem settings, a prior set of object detections may be\nmade available either as ground truth annotations or as proposals from an independent system. It is\ngood to have some way of incorporating these into the network. We do this by formatting an object\ndetection as a two channel input where one channel consists of a one-hot activation at the center of\nthe object bounding box and the other provides a binary mask of the box. Multiple boxes can be\ndisplayed on these two channels, with the \ufb01rst indicating the center of each box and the second, the\nunion of their masks.\nIf provided with a large set of detections, this representation becomes too crowded so we either\nseparate bounding boxes by object class, or if no class information is available, by bounding box\nanchors. To reduce computational cost this additional input is incorporated after several layers\nof convolution and pooling have been applied to the input image. For example, we set up this\nrepresentation at the output resolution, 64x64, then apply several consecutive 1x1 convolutions to\nremap the detections to a feature tensor with f channels. Then, we add this result to the \ufb01rst feature\ntensor produced by the hourglass network at the same resolution and number of channels.\nSparse supervision: It is important to note that it is almost impossible to exhaustively annotate\nimages for scene graphs. A large number of possible relationships can be described between pairs of\nobjects in a real-world scene. The network is likely to generate many reasonable predictions that are\nnot covered in the ground truth. We want to reduce the penalty associated with these detections and\nencourage the network to produce as many detections as possible. There are a few properties of our\ntraining pipeline that are conducive to this.\nFor example, we do not need to supervise the entire heatmap for object and relationship detections.\nInstead, we apply a loss at the pixels we know correspond to positive detections, and then randomly\nsample some fraction from the rest of the image to serve as negatives. This balances the proportion of\npositive and negative samples, and reduces the chance of falsely penalizing unannotated detections.\n\n5 Experiments\n\nDataset: We evaluate the performance of our method on the Visual Genome dataset [14]. Visual\nGenome consists of 108,077 images annotated with object detections and object-object relationships,\nand it serves as a challenging benchmark for scene graph generation on real world images. Some\n\n7\n\n\fSGGen (no RPN)\nR@50 R@100 R@50 R@100 R@50 R@100 R@50 R@100\n\nSGGen (w/ RPN)\n\nPredCls\n\nSGCls\n\nLu et al. [18]\nXu et al. [26]\nOur model\n\n\u2013\n\u2013\n6.7\n\n\u2013\n\u2013\n7.8\n\n0.3\n3.4\n9.7\n\n0.5\n4.2\n11.3\n\n11.8\n21.7\n26.5\n\n14.1\n24.4\n30.0\n\n27.9\n44.8\n68.0\n\n35.0\n53.0\n75.2\n\nTable 1: Results on Visual Genome\n\nPredicate R@100\nwearing\n\nhas\non\n\nwears\n\nof\n\nriding\nholding\n\nin\n\nsitting on\ncarrying\n\n87.3\n80.4\n79.3\n77.1\n76.1\n74.1\n66.9\n61.6\n58.4\n56.1\n\nPredicate\n\nto\nand\n\nplaying\nmade of\npainted on\nbetween\nagainst\n\ufb02ying in\ngrowing on\n\nfrom\n\nR@100\n\n5.5\n5.4\n3.8\n3.2\n2.5\n2.3\n1.6\n0.0\n0.0\n0.0\n\nFigure 4: How detections are distributed across the\nsix available slots for relationships.\n\nTable 2: Performance per relationship predicate\n(top ten on left, bottom ten on right)\n\nprocessing has to be done before using the dataset as objects and relationships are annotated with\nnatural language not with discrete classes, and many redundant bounding box detections are provided\nfor individual objects. To make a direct comparison to prior work we use the preprocessed version of\nthe set made available by Xu et al. [26]. Their network is trained to predict the 150 most frequent\nobject classes and 50 most frequent relationship predicates in the dataset. We use the same categories,\nas well as the same training and test split as de\ufb01ned by the authors.\nTask: The scene graph task is de\ufb01ned as the production of a set of subject-predicate-object tuples. A\nproposed tuple is composed of two objects de\ufb01ned by their class and bounding box and the relationship\nbetween them. A tuple is correct if the object and relationship classes match those of a ground truth\nannotation and the two objects have at least a 0.5 IoU overlap with the corresponding ground truth\nobjects. To avoid penalizing extra detections that may be correct but missing an annotation, the\nstandard evaluation metric used for scene graphs is Recall@k which measures the fraction of ground\ntruth tuples to appear in a set of k proposals. Following [26], we report performance on three problem\nsettings:\n\nSGGen: Detect and classify all objects and determine the relationships between them.\nSGCls: Ground truth object boxes are provided, classify them and determine their relationships.\nPredCls: Boxes and classes are provided for all objects, predict their relationships.\n\nSGGen corresponds to the full scene graph task while PredCls allows us to focus exclusively on\npredicate classi\ufb01cation. Example predictions on the SgGen and PredCls tasks are shown in Figure\n3. It can be seen in Table 1 that on all three settings, we achieve a signi\ufb01cant improvement in\nperformance over prior work. It is worth noting that prior approaches to this problem require a\nset of object proposal boxes in order to produce their predictions. For the full scene graph task\n(SGGen) these detections are provided by a Region Proposal Network (RPN) [24]. We evaluate\nperformance with and without the use of RPN boxes, and achieve promising results even without the\nuse of proposal boxes - using nothing but the raw image as input. Furthermore, the network is trained\nfrom scratch, and does not rely on pretraining on other datasets.\nDiscussion: There are a few interesting results that emerge from our trained model. The network\nexhibits a number of biases in its predictions. For one, the vast majority of predicate predictions\ncorrespond to a small fraction of the 50 predicate classes. Relationships like \u201con\u201d and \u201cwearing\u201d tend\nto completely dominate the network output, and this is in large part a function of the distribution of\nground truth annotations of Visual Genome. There are several orders of magnitude more examples for\n\n8\n\n\f\u201con\u201d than most other predicate classes. This discrepancy becomes especially apparent when looking\nat the performance per predicate class in Table 2. The poor results on the worst classes do not have\nmuch effect on \ufb01nal performance since there are so few instances of relationships labeled with those\npredicates.\nWe do some additional analysis to see how the network \ufb01lls its \u201cslots\u201d for relationship detection.\nRemember, at a particular pixel the network produces a set of dectection and this is expressed by\n\ufb01lling out a \ufb01xed set of available slots. There is no explicit mapping telling the network which slots it\nshould put particular detections. From Figure 4, we see that the network learns to divide slots up such\nthat they correspond to subsets of predicates. For example, any detection for the predicates behind,\nhas, in, of, and on will exclusively fall into three of the six available slots. This pattern emerges for\nmost classes, with the exception of wearing/wears where detections are distributed uniformly across\nall six slots.\n\n6 Conclusion\n\nThe qualities of a graph that allow it to capture so much information about the semantic content of an\nimage come at the cost of additional complexity for any system that wishes to predict them. We show\nhow to supervise a network such that all of the reasoning about a graph can be abstracted away into a\nsingle network. The use of associative embeddings and unordered output slots offer the network the\n\ufb02exibility necessary to making training of this task possible. Our results on Visual Genome clearly\ndemonstrate the effectiveness of our approach.\n\n7 Acknowledgements\n\nThis publication is based upon work supported by the King Abdullah University of Science and\nTechnology (KAUST) Of\ufb01ce of Sponsored Research (OSR) under Award No. OSR-2015-CRG4-\n2639.\n\nReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale machine learning on\nheterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.\n\n[2] Yuval Atzmon, Jonathan Berant, Vahid Kezami, Amir Globerson, and Gal Chechik. Learning to generalize\n\nto new compositions in image understanding. arXiv preprint arXiv:1608.07639, 2016.\n\n[3] Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. Learning to detect human-object\n\ninteractions. arXiv preprint arXiv:1702.05448, 2017.\n\n[4] Bo Dai, Yuqi Zhang, and Dahua Lin. Detecting visual relationships with deep relational networks. arXiv\n\npreprint arXiv:1704.03114, 2017.\n\n[5] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A\ndeep visual-semantic embedding model. In Advances in neural information processing systems, pages\n2121\u20132129, 2013.\n\n[6] Andrea Frome, Yoram Singer, Fei Sha, and Jitendra Malik. Learning globally-consistent local distance\nfunctions for shape-based image retrieval and classi\ufb01cation. In 2007 IEEE 11th International Conference\non Computer Vision, pages 1\u20138. IEEE, 2007.\n\n[7] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate\nobject detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision\nand pattern recognition, pages 580\u2013587, 2014.\n\n[8] Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. Improving\nimage-sentence embeddings using large weakly annotated photo collections. In European Conference on\nComputer Vision, pages 529\u2013545. Springer, 2014.\n\n[9] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping.\nIn Computer vision and pattern recognition, 2006 IEEE computer society conference on, volume 2, pages\n1735\u20131742. IEEE, 2006.\n\n9\n\n\f[10] Kaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross Girshick. Mask r-cnn. arxiv preprint. arXiv preprint\n\narXiv:1703.06870, 2017.\n\n[11] Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. Modeling relationships\nin referential expressions with compositional modular networks. arXiv preprint arXiv:1611.09978, 2016.\n\n[12] Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-\nFei. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 3668\u20133678, 2015.\n\n[13] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128\u20133137,\n2015.\n\n[14] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen,\nYannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. Visual genome:\nConnecting language and vision using crowdsourced dense image annotations. 2016.\n\n[15] Yikang Li, Wanli Ouyang, and Xiaogang Wang. Vip-cnn: A visual phrase reasoning convolutional neural\n\nnetwork for visual relationship detection. arXiv preprint arXiv:1702.07191, 2017.\n\n[16] Xiaodan Liang, Lisa Lee, and Eric P Xing. Deep variation-structured reinforcement learning for visual\n\nrelationship and attribute detection. arXiv preprint arXiv:1703.03054, 2017.\n\n[17] Wentong Liao, Michael Ying Yang, Hanno Ackermann, and Bodo Rosenhahn. On support relations and\n\nsemantic scene graphs. arXiv preprint arXiv:1609.05834, 2016.\n\n[18] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language\n\npriors. In European Conference on Computer Vision, pages 852\u2013869. Springer, 2016.\n\n[19] Cewu Lu, Hao Su, Yongyi Lu, Li Yi, Chikeung Tang, and Leonidas Guibas. Beyond holistic object\n\nrecognition: Enriching image understanding with part states. arXiv preprint arXiv:1612.07310, 2016.\n\n[20] Alejandro Newell and Jia Deng. Associative embedding: End-to-end learning for joint detection and\n\ngrouping. arXiv preprint arXiv:1611.05424, 2016.\n\n[21] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In\n\nEuropean Conference on Computer Vision, pages 483\u2013499. Springer, 2016.\n\n[22] Bryan A Plummer, Arun Mallya, Christopher M Cervantes, Julia Hockenmaier, and Svetlana Lazebnik.\nPhrase localization and visual relationship detection with comprehensive linguistic cues. arXiv preprint\narXiv:1611.06641, 2016.\n\n[23] David Raposo, Adam Santoro, David Barrett, Razvan Pascanu, Timothy Lillicrap, and Peter Battaglia. Dis-\ncovering objects and their relations from entangled scene representations. arXiv preprint arXiv:1702.05068,\n2017.\n\n[24] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection\nwith region proposal networks. In Advances in neural information processing systems, pages 91\u201399, 2015.\n\n[25] Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. Distance metric learning for large margin nearest\nneighbor classi\ufb01cation. In Advances in neural information processing systems, pages 1473\u20131480, 2005.\n\n[26] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative message\n\npassing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.\n\n[27] Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. Visual translation embedding network\n\nfor visual relation detection. arXiv preprint arXiv:1702.08319, 2017.\n\n[28] Bohan Zhuang, Lingqiao Liu, Chunhua Shen, and Ian Reid. Towards context-aware interaction recognition.\n\narXiv preprint arXiv:1703.06246, 2017.\n\n10\n\n\f", "award": [], "sourceid": 1294, "authors": [{"given_name": "Alejandro", "family_name": "Newell", "institution": "University of Michigan"}, {"given_name": "Jia", "family_name": "Deng", "institution": "University of Michigan"}]}