{"title": "LinkNet: Relational Embedding for Scene Graph", "book": "Advances in Neural Information Processing Systems", "page_first": 560, "page_last": 570, "abstract": "Objects and their relationships are critical contents for image understanding. A scene graph provides a structured description that captures these properties of an image. However, reasoning about the relationships between objects is very challenging and only a few recent works have attempted to solve the problem of generating a scene graph from an image. In this paper, we present a novel method that improves scene graph generation by explicitly modeling inter-dependency among the entire object instances. We design a simple and effective relational embedding module that enables our model to jointly represent connections among all related objects, rather than focus on an object in isolation. Our novel method significantly benefits two main parts of the scene graph generation task: object classification and relationship classification. Using it on top of a basic Faster R-CNN, our model achieves state-of-the-art results on the Visual Genome benchmark. We further push the performance by introducing global context encoding module and geometrical layout encoding module. We validate our final model, LinkNet, through extensive ablation studies, demonstrating its efficacy in scene graph generation.", "full_text": "LinkNet: Relational Embedding for Scene Graph\n\nSanghyun Woo*\u2217\n\nEE, KAIST\n\nDaejeon, Korea\n\nshwoo93@kaist.ac.kr\n\nDahun Kim*\nEE, KAIST\n\nDaejeon, Korea\n\nmcahny@kaist.ac.kr\n\nIn So Kweon\nEE, KAIST\n\nDaejeon, Korea\n\niskweon@kaist.ac.kr\n\nDonghyeon Cho\n\nEE, KAIST\n\nDaejeon, Korea\n\ncdh12242@gmail.com\n\nAbstract\n\nObjects and their relationships are critical contents for image understanding. A\nscene graph provides a structured description that captures these properties of\nan image. However, reasoning about the relationships between objects is very\nchallenging and only a few recent works have attempted to solve the problem of\ngenerating a scene graph from an image. In this paper, we present a method that\nimproves scene graph generation by explicitly modeling inter-dependency among\nthe entire object instances. We design a simple and effective relational embedding\nmodule that enables our model to jointly represent connections among all related\nobjects, rather than focus on an object in isolation. Our method signi\ufb01cantly\nbene\ufb01ts main part of the scene graph generation task: relationship classi\ufb01cation.\nUsing it on top of a basic Faster R-CNN, our model achieves state-of-the-art\nresults on the Visual Genome benchmark. We further push the performance by\nintroducing global context encoding module and geometrical layout encoding\nmodule. We validate our \ufb01nal model, LinkNet, through extensive ablation studies,\ndemonstrating its ef\ufb01cacy in scene graph generation.\n\n1\n\nIntroduction\n\nCurrent state-of-the-art recognition models have made signi\ufb01cant progress in detecting individual\nobjects in isolation [9, 20]. However, we are still far from reaching the goal of capturing the\ninteractions and relationships between these objects. While objects are the core elements of an\nimage, it is often the relationships that determine the global interpretation of the scene. The deeper\nunderstating of visual scene can be realized by building a structured representation which captures\nobjects and their relationships jointly. Being able to extract such graph representations have been\nshown to bene\ufb01t various high-level vision tasks such as image search [13], question answering [2],\nand 3D scene synthesis [29].\nIn this paper, we address scene graph generation, where the objective is to build a visually-grounded\nscene graph of a given image. In a scene graph, objects are represented as nodes and relationships\nbetween them as directed edges. In practice, a node is characterized by an object bounding box\nwith a category label, and an edge is characterized by a predicate label that connects two nodes as a\nsubject-predicate-object triplet. As such, a scene graph is able to model not only what objects are in\nthe scene, but how they relate to each other.\n\n\u2217Both authors have equally contributed\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe key challenge in this task is to reason about inter-object relationships. We hypothesize that\nexplicitly modeling inter-dependency among the entire object instances can improve a model\u2019s\nability to infer their pairwise relationships. Therefore, we propose a simple and effective relational\nembedding module that enables our model to jointly represent connections among all related objects,\nrather than focus on an object in isolation. This signi\ufb01cantly bene\ufb01ts main part of the scene graph\ngeneration task: relationship classi\ufb01cation.\nWe further improve our network by introducing global context encoding module and geometrical\nlayout encoding module. It is well known that fusing global and local information plays an important\nrole in numerous visual tasks [8, 23, 39, 6, 40, 36]. Motivated by these works, we build a module\nthat can provide contextual information. In particular, the module consists of global average pooling\nand binary sigmoid classi\ufb01ers, and is trained for multi-label object classi\ufb01cation. This encourages\nits intermediate features to represent all object categories present in an image, and supports our full\nmodel. Also, for the geometrical layout encoding module, we derive inspiration from the fact the\nmost relationships in general are spatially regularized, implying that subject-object relative geometric\nlayout can thus be a powerful cue for inferring the relationship in between. Our novel architecture\nresults in our \ufb01nal model LinkNet, of which the overall architecture is illustrated in Fig. 1.\nOn the Visual Genome dataset, LinkNet obtains state-of-the-art results in scene graph generation\ntasks, revealing the ef\ufb01cacy of our approach. We visualize the weight matrices in relational embedding\nmodule and observe that inter-dependency between objects are indeed represented(see Fig. 2).\n\nContribution. Our main contribution is three-fold.\n\n1. We propose a simple and effective relational embedding module in order to explicitly model\ninter-dependency among entire objects in an image. The relational embedding module\nimproves the overall performance signi\ufb01cantly.\n\n2. In addition, we introduce global context encoding module and geometrical layout encoding\n\nmodule for more accurate scene graph generation.\n\n3. The \ufb01nal network, LinkNet, has achieved new state-of-the art performance in scene graph\ngeneration tasks on the large-scale benchmark [16]. Extensive ablation studies demonstrate\nthe effectiveness of the proposed network.\n\n2 Related Work\n\nRelational Reasoning Relational reasoning has been explicitly modeled and adopted in neural\nnetworks. In the early days, most works attempted to apply neural networks to graphs, which are a\nnatural structure for de\ufb01ning relations [11, 15, 24, 28, 1, 32]. Recently, the more ef\ufb01cient relational\nreasoning modules have been proposed [27, 30, 31]. Those can model dependency between the\nelements even with the non-graphical inputs, aggregating information from the feature embeddings at\nall pairs of positions in its input (e.g., pixels or words). The aggregation weights are automatically\nlearned driven by the target task. While our work is connected to the previous works, an apparent\ndistinction is that we consider object instances instead of pixels or words as our primitive elements.\nSince the objects have variations in scale/aspect ratio, we use ROI-align operation [9] to generate\n\ufb01xed 1D representations, easing the subsequent relation computations.\nMoreover, relational reasoning of our model has a link to an attentional graph neural network. Similar\nto ours, Chen et.al. [4] uses a graph to encode spatial and semantic relations between regions\nand classes and passes information among them. To do so, they build a commonsense knowledge\ngraph( i.e.adjacency matrix) from relationship annotations in the set. However, our approach does\nnot require any external knowledge sources for the training. Instead, the proposed model generates\nsoft-version of adjacency matrix(see Fig. 2) on-the-\ufb02y by capturing the inter-dependency among the\nentire object instances.\n\nRelationship Detection The task of recognizing objects and the relationships has been investigated\nby numerous studies in a various form. This includes detection of human-object interactions [7, 3],\nlocalization of proposals from natural language expressions [12], or the more general tasks of visual\nrelationship detection [17, 25, 38, 5, 19, 37, 34, 41] and scene graph generation [33, 18, 35, 22].\nAmong them, scene graph generation problem has recently drawn much attention. The challenging\nand open-ended nature of the task lends itself to a variety of diverse methods. For example: \ufb01xing\n\n2\n\n\fthe structure of the graph, then re\ufb01ning node and edge labels using iterative message passing [33];\nutilizing associative embedding to simultaneously identify nodes and edges of graph and piece them\ntogether [22]; extending the idea of the message passing from [33] with additional RPN in order to\npropose regions for captioning and solve tasks jointly [18]; staging the inference process in three-step\nbased on the \ufb01nding that object labels are highly predictive of relation labels [35];\nIn this work, we utilize relational embedding for scene graph generation. It utilizes a basic self-\nattention mechanism [30] within the aggregation weights. Compared to previous models [33, 18]\nthat have been proposed to focus on message passing between nodes and edges, our model explicitly\nreasons about the relations within nodes and edges and predicts graph elements in multiple steps [35],\nsuch that features of a previous stage provides rich context to the next stage.\n\n3 Proposed Approach\n\n3.1 Problem De\ufb01nition\n\nA scene graph is a topological representation of a scene, which encodes object instances, correspond-\ning object categories, and relationships between the objects. The task of scene graph generation is to\nconstruct a scene graph that best associates its nodes and edges with the objects and the relationships\nin an image, respectively.\nFormally, the graph contains a node set V and an edge set E. Each node vi is represented by a\ni \u2208 Cobj. Each edge ei\u2192j \u2208 Crel\nbounding box vbbox\nde\ufb01nes a relationship predicate between the subject node vi and object node vj. Cobj is a set of\nobject classes, Crel is a set of relationships. At the high level, the inference task is to classify objects,\npredict their bounding box coordinates, and classify pairwise relationship predicates between objects.\n\n\u2208 R4, and a corresponding object class vcls\n\ni\n\n3.2 LinkNet\n\nAn overview of LinkNet is shown in Fig. 1. To generate a visually grounded scene graph, we need to\nstart with an initial set of object bounding boxes, which can be obtained from ground-truth human\nannotation or algorithmically generated. Either cases are somewhat straightforward; In practice, we\nuse a standard object detector, Faster R-CNN [26], as our bounding box model (P r(V bbox|I)). Given\nan image I, the detector predicts a set of region proposals V bbox. For each proposal vbbox\n, it also\noutputs a ROI-align feature vector f RoI\nWe build upon these initial object features f RoI\n, li and design a novel scene graph generation network\nthat consists of three modules. The \ufb01rst module is a relational embedding module that explicitly\nmodels inter-dependency among all the object instances. This signi\ufb01cantly improves relationship\nclassi\ufb01cation(P r(Ei\u2192j|I, V bbox, V cls)). Second, global context encoding module provides our\nmodel with contextual information. Finally, the performance of predicate classi\ufb01cation is further\nboosted by our geometric layout encoding.\nIn the following subsections, we will explain how each proposed modules are used in two main steps\nof scene graph generation: object classi\ufb01cation, and relationship classi\ufb01cation.\n\nand an object label distribution li.\n\ni\n\ni\n\ni\n\n3.3 Object Classi\ufb01cation\n\n3.3.1 Object-Relational Embedding\nFor each region proposal, we construct a relation-based representation by utilizing the object features\n\u2208 R4096 and embedded object label distribu-\nfrom the underlying RPN: the ROI-aligned feature f RoI\ntion K0li \u2208 R200. K0 denotes a parameter matrix that maps the distribution of predicted classes, li,\nto R200. In practice, we use an additional image-level context features c \u2208 R512, so that each object\nproposal is \ufb01nally represented as a concatenated vector oi = (f RoI\n, K0li, c). We detail on the global\ncontext encoding in Sec. 3.3.2.\nThen, for a given image, we can obtain N object proposal features oi =1,...,N. Here, we consider\nobject-relational embedding R that computes the response for one object region oi by attending to the\nfeatures from all N object regions. This is inspired by the recent works for relational reasoning [27,\n30, 31]. Despite the connection, what makes our work distinctive is that we consider object-level\n\ni\n\ni\n\n3\n\n\fFigure 1: The overview of LinkNet. The model predicts graph in three steps: bounding box proposal, object\nclassi\ufb01cation, and relationship classi\ufb01cation. The model consists of three modules: global context encoding\nmodule, relational embedding module, and geometric layout encoding module. Best viewed in color.\n\ninstances as our primitive elements, whereas the previous methods operate on pixels [27, 31] or\nwords [30].\nIn practice, we stack all the object proposal features to build a matrix O0 \u2208 RN\u00d74808, from where\nwe can compute a relational embedding matrix R1 \u2208 RN\u00d7N. Then, the relation-aware embedded\nfeatures O2 \u2208 RN\u00d7256 are computed as:\n\nR1 = softmax((O0W1)(O0U1)T) \u2208 RN\u00d7N,\nO1 = O0 \u2295 fc0 ((R1(O0H1))) \u2208 RN\u00d74808,\nO2 = fc1 (O1) \u2208 RN\u00d7256,\n\n(1)\n(2)\n(3)\n\nwhere W1, U1 and H1 are parameter matrices that map the object features, O0 to RN\u00d7 4808\n, here we\nfound setting hyper-parameter r as 2 produces best result from our experiment. The softmax operation\nis conducted in row-wise, constructing an embedding matrix. f c0 and f c1 are a parameter matrices\nthat map its input feature of RN\u00d7 4808\nto RN\u00d74808, and RN\u00d74808 to an embedding space RN\u00d7256,\nrespectively. \u2295 denotes a element-wise summation, allowing an ef\ufb01cient training overall due to\nresidual learning mechanism [10]. The resulting feature O2 again goes through a similar relational\nembedding process, and is eventually embedded into object label distribution O4 \u2208 RN\u00d7Cobj as:\n\nr\n\nr\n\nR2 = softmax((O2W2)(O2U2)T) \u2208 RN\u00d7N,\nO3 = O2 \u2295 fc2 ((R2(O2H2))) \u2208 RN\u00d7256,\nO4 = fc3 (O3) \u2208 RN\u00d7Cobj,\n\n4\n\n(4)\n(5)\n(6)\n\nRPNVGG16ROIAlignFCAvgPoolFCObj1Global Context Encoding ModuleObj2ObjNObjdistFeatureVisual FeatureContextFeatureRelational EmbeddingFCRelational EmbeddingFCMulti-label ClassificationObj1Relative Object Layout,Relational Embedding ModuleRelational EmbeddingFCRelational EmbeddingFCObj2ObjNObjdistFeatureEmbeddedFeatureObj1Obj2ObjNSubj-EdgeFeatureObj-EdgeFeatureRelation ClassificationCarin front of TreeObject ClassificationBounding box ProposalAirplaneAnimalCarSignStreetGeometric Layout Encoding ModuleK0 lifiRoIcK1O\u20194O3E1siE1ojFi,jK2 (bo|s)\fwhere W2, U2 and H2 map the object features, O2 to RN\u00d7 256\nr . The softmax operation is conducted\nin row-wise, same as above. f c2 and f c3 are another parameter matrices that map the intermediate\nfeatures into RN\u00d7 256\nto RN\u00d7256, and RN\u00d7256 to RN\u00d7Cobj , respectively. Finally, the Cobj-way\nobject classi\ufb01cation P r(V cls|I, V bbox) is optimized on the resulting feature O4 as:\n\nr\n\n\u02c6V cls = O4,\n\n(cid:88)\n\nLobj_cls = \u2212\n\n3.3.2 Global Context Encoding\n\nV cls log( \u02c6V cls).\n\n(7)\n(8)\n\nHere we describe the global context encoding module in detail. This module is designed with the\nintuition that knowing contextual information in prior may help inferring individual objects in the\nscene.\nIn practice, we introduce an auxiliary task of multi-label classi\ufb01cation, so that the intermediate\nfeatures c can encode all kinds of objects present in an image. More speci\ufb01cally, the global context\nencoding c \u2208 R512 is taken from an average pooling on the RPN image features(R512\u00d7H\u00d7W ), as\nshown in Fig. 1. This feature c is concatenated with the initial image features (fi, K0li) as explained\nin Sec. 3.3.1, and supports scene graph generation performance as we will demonstrate in Sec. 4.2.\nAfter one parameter matrix , c becomes multi-label distribution \u02c6M \u2208 (0, 1)Cobj, and multi-label\nobject classi\ufb01cation (gce loss) is optimized on the ground-truth labels M \u2208 [0, 1]Cobj as:\n\nMc log( \u02c6Mc).\n\n(9)\n\n(cid:88) Cobj(cid:88)\n\nc=1\n\nLgce = \u2212\n\n3.4 Relationship Classi\ufb01cation\n\n3.4.1 Edge-Relational Embedding\n\nAfter the object classi\ufb01cation, we further construct relation-based representations suitable for rela-\ntionship classi\ufb01cation. For this, we apply another sequence of relational embedding modules. In\nparticular, the output of the previous object-relational embedding module O4 \u2208 RN\u00d7Cobj, and the\nintermediate feature O3 \u2208 RN\u00d7256 are taken as inputs as:\n\nO\n\n(cid:48)\n\n4 = argmax(O4) \u2208 RN\u00d7Cobj ,\n4, O3) \u2208 RN\u00d7(200+256),\n\nE0 = (K1O\n\n(cid:48)\n\n(10)\n(11)\n\n(12)\n(13)\n(14)\n\nwhere the argmax is conducted row-wise and produces an one-hot encoded vector O(cid:48)\n4 which is\nthen mapped into RN\u00d7200 by a parameter matrix K1. Then, similar embedding operations as in\nSec. 3.3.1 are applied on E0, resulting in embedded features E1 \u2208 RN\u00d78192, where the half of the\nchannels(4096) refers to subject edge features and its counterpart refers to object (see Fig. 1).\nFor each possible N2 \u2212 N edges, say between vi and vj, we compute the probability the edge will\nhave label ei\u2192j (including the background). We operate on E1 and an embedded features from the\nunion region of i-th and j-th object regions, F = { fi,j | i \u2208 (1, 2, ...N), j \u2208 (1, 2, ..., N), j (cid:54)= i }\n\u2208 RN(N\u22121)\u00d74096 as:\n\nG0ij = (E1si \u2297 E1oj \u2297 Fij) \u2208 R4096,\nG1 = (G0, K2(bo|s)) \u2208 RN(N\u22121)\u00d7(4096+128),\nG2 = fc4 (G1) \u2208 RN(N\u22121)\u00d7Crel.\n\nWe combine subject edge features, object edge features and union image representations by low-\nrank outer product [14]. bo|s denotes relative geometric layout which is detailed in Sec. 3.4.2.\nIt is embedded into RN(N\u22121)\u00d7128 by a parameter matrix K2. A parameter matrix f c4 maps the\nintermediate features G1 \u2208 RN(N\u22121)\u00d74224 into G2 \u2208 RN(N\u22121)\u00d7Crel\n\n5\n\n\fPredicate Classi\ufb01cation\n\nScene Graph Classi\ufb01cation\n\nScene Graph Detection\n\nR@20 R@50 R@100 R@20 R@50 R@100 R@20 R@50 R@100\n\nMethods\nVRD [21]\nMESSAGE PASSING [33]\nASSOC EMBED [22]\nMOTIFNET [35]\nLinkNet\nTable 1: The table shows our model achieves state-of-the-art result in Visual Genome benchmark [16]. Note\nthat the Predicate Classi\ufb01cation and Scene Graph Classi\ufb01cation tasks assume exactly same perfect detector\nacross the methods, and evaluate how well the each models predict object labels and their relations, while Scene\nGraph Detection task takes a customized pre-trained detector and performs subsequent tasks.\n\n11.8\n21.7\n21.8\n35.8\n41\n\n14.1\n24.4\n22.6\n36.5\n41.7\n\n27.9\n44.8\n54.1\n65.2\n67.0\n\n35.0\n53.0\n55.4\n67.1\n68.5\n\n0.3\n3.4\n8.1\n27.2\n27.4\n\n0.5\n4.2\n8.2\n30.3\n30.1\n\n6.5\n21.4\n22.3\n\n47.9\n58.5\n61.8\n\n18.2\n32.9\n38.3\n\nFinally, the Crel-way relationship classi\ufb01cation P r(Ei\u2192j|I, V bbox, V cls) is optimized on the result-\ning feature G2 as:\n\n\u02c6Ei\u2192j = G2,\n\nN(cid:88)\n\n(cid:88)\n\ni=1\n\nj(cid:54)=i\n\nLrel_cls = \u2212\n\nEi\u2192j log( \u02c6Ei\u2192j).\n\n(15)\n\n(16)\n\n3.4.2 Geometric Layout Encoding\nWe hypothesize that relative geometry between the subject and object is a powerful cue for inferring\nthe relationship between them. Indeed, many predicates have straightforward correlation with the\nsubject-object relative geometry, whether they are geometric (e.g., \u2019behind\u2019), possessive (e.g.,\u2019has\u2019),\nor semantic (e.g.,\u2019riding\u2019).\nTo exploit this cue, we encode the relative location and scale information as :\n\nbo|s = (\n\nxo \u2212 xs\n\nws\n\n,\n\nyo \u2212 ys\n\nhs\n\n, log(\n\nwo\nws\n\n), log(\n\nho\nhs\n\n)),\n\n(17)\n\nwhere x, y, w, and h denote the x,y-coordinates, width, and height of the object proposal, and the\nsubscripts o and s denote object and subject, respectively. we embed bo|s to a feature in RN\u00d7128 and\nconcatenate this with the subject-object features as in Eq. (13).\n\n3.5 Loss\n\nThe whole network can be trained in an end-to-end manner, allowing the network to predict object\nbounding boxes, object categories, and relationship categories sequentially (see Fig. 1). Our loss\nfunction for an image is de\ufb01ned as:\n\nBy default, we set \u03bb1 and \u03bb2 as 1, and thus all the terms are equally weighted.\n\nLf inal = Lobj_cls + \u03bb1Lrel_cls + \u03bb2Lgce.\n\n(18)\n\n4 Experiments\n\nWe conduct experiments on Visual Genome benchmark [16].\n\n4.1 Quantitative Evaluation\n\nSince the current work in scene graph generation is largely inconsistent in terms of data splitting and\nevaluation, we compared against papers [21, 33, 22, 35] that followed the original work [33]. The\nexperimental results are summarized in Table. 1.\nThe LinkNet achieves new state-of-the-art results in Visual Genome benchmark [16], demonstrating\nits ef\ufb01cacy in identifying and associating objects. For the scene graph classi\ufb01cation and predicate\nclassi\ufb01cation tasks, our model outperforms the strong baseline [35] by a large margin. Note that\npredicate classi\ufb01cation and scene graph classi\ufb01cation tasks assume the same perfect detector across\nthe methods, whereas scene graph detection task depends on a customized pre-trained detector.\n\n6\n\n\fIndependent Variables\nNumber of REM\n\nReduction ratio (r)\n\n1\n\nValue\n\nours(2)\n\nScene Graph Classi\ufb01cation\nR@20 R@50 R@100\n37.7\n38.3\n37.9\n38\n38\n38.3\n38.2\n37.7\n\n40.4\n41\n40.6\n40.7\n40.9\n41\n41\n40.5\na Experiments on hyperparams.\n\n41\n41.7\n41.3\n41.4\n41.6\n41.7\n41.6\n41.2\n\nours(2)\n\n3\n4\n1\n\n4\n8\n\nExp\n\n1\n\n2\n\nOurs\n\nargmax(O4)\n\nOperation\n\nScene Graph Classi\ufb01cation\nconcat(O3) R@20 R@50 R@100\n\n(cid:88)\n\n(cid:88)\n\n37.3\n\n38\n\n39.8\n\n40.7\n\n(cid:88)\n\n(cid:88)\n41\nb Design-choices in constructing E0.\n\n38.3\n\n40.6\n\n41.4\n\n41.7\n\nTable 2: (a) includes experiments for the optimal value for the two hyper parameters; (b) includes experiments\nto verify the effective design choices of constructing E0\n\nScene Graph Classi\ufb01cation\nSigmoid Dot prod Eucli R@20 R@50 R@100\n\nExp\n1\n2\n3\n4\n5\n6\n\nProposed\n\nREM GLEM GCEM Softmax\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nRow-wise\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n\nSimilarity\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n\n37.4\n37.9\n38.0\n37.7\n37.2\n37.9\n38.3\n\n40.0\n40.4\n40.6\n40.3\n40.0\n40.7\n41\n\n40.8\n41.2\n41.3\n41\n40.7\n41.4\n41.7\n\nR@100\n\nw. GLEM w.o GLEM\n\npredicate\nusing\ncarrying\nriding\nbehind\nat\nin front of\nhas\nwearing\non\nsitting on\n\n0.269\n0.246\n0.249\n0.341\n0.072\n0.094\n0.495\n0.488\n0.570\n0.088\n\n0.000\n0.118\n0.138\n0.287\n0.040\n0.069\n0.473\n0.468\n0.551\n0.070\n\nOurs\nTable 3: The left table shows ablation studies on the \ufb01nal model. The right table summarizes top-10 predicates\nwith highest recall increase in scene graph classi\ufb01cation with the use of geometric layout encoding module.\nREM, GLEM, GCEM denotes Relational Embedding Module, Geometric Layout Encoding Module, and\nGlobal Context Encoding Moudle respectively.\n\n4.2 Ablation Study\n\nIn order to evaluate the effectiveness of our model, we conduct four ablation studies based on the\nscene graph classi\ufb01cation task as follows. Results of the ablation studies are summarized in Table. 2\nand Table. 3.\n\nExperiments on hyperparameters\nThe \ufb01rst row of Table. 2a shows the results of more relational\nembedding modules. We argue that multiple modules can perform multi-hop communication. Mes-\nsages between all the objects can be effectively propagated, which is hard to do via standard models.\nHowever, too many modules can arise optimization dif\ufb01culty. Our model with two REMs achieved\nthe best results. In second row of Table. 2a, we compare performance with four different reduction\nratios. The reduction ratio determines the number of channels in the module, which enables us to\ncontrol the capacity and overhead of the module. The reduction ratio 2 achieves the best accuracy,\neven though the reduction ratio 1 allows higher model capacity. We see this as an over-\ufb01tting since\nthe training losses converged in both cases. Overall, the performance drops off smoothly across the\nreduction ratio, demonstrating that our approach is robust to it.\n\nDesign-choices in constructing E0 Here we construct an input(E0) of edge-relational embedding\nmodule by combining an object class representation(O(cid:48)\n4) and a global contextual representation(O3).\nThe operations are inspired by the recent work [35] that contextual information is critical for the\nrelationship classi\ufb01cation of an objects. To do so, we turn O4 of object label probabilities into one-hot\nvectors via an argmax operation(committed to a speci\ufb01c object class label) and we concatenate it with\nan output(O3) which passed through the relational embedding module(contextualized representation).\nAs shown in the Table. 2b, we empirically con\ufb01rm that both operations contribute to the performance\nboost.\n\nThe effectiveness of proposed modules. We perform an ablation study to validate our modules\nin the network, which are relation embedding module, geometric layout encoding module, and\nglobal context encoding module. We remove each module to verify the effectiveness of utilizing\nall the proposed modules. As shown in Exp 1, 2, 3, and Ours, we can clearly see the performance\nimprovement when we use all the modules jointly. This shows that each module plays a critical\nrole together in inferring object labels and their relationships. Note that Exp 1 already achieves\n\n7\n\n\fFigure 2: Visualization of relational embedding matrices. For each example, the \ufb01rst row shows ground-truth\nobject regions. The left and right column of the second row show ground-truth relations (binary, 1 if present, 0\notherwise), and the weights of our relational embedding matrix, respectively. Note how the relational embedding\nrelates the objects with a real connection, compared to those in none-relationship.\n\nstate-of-the-art result, showing that utilizing relational embedding module is crucial while the other\nmodules further boost performance.\n\nThe effectiveness of GLEM. We conduct additional analysis to see how the network performs with\nthe use of geometric layout encoding module. We select top-10 predicates with highest recall increase\nin scene graph classi\ufb01cation task. As shown in right side of Table. 3, we empirically con\ufb01rm that the\nrecall value of geometrically related predicates are signi\ufb01cantly increased, such as using, carrying,\nriding. In other words, predicting predicates which have clear subject-object relative geometry, was\nhelped by the module.\n\nRow-wise operation methods.\nIn this experiment, we conduct an ablation study to compare row-\nwise operation methods in relational embedding matrix: softmax and sigmoid; As we can see in Exp\n4 and Ours, softmax operation which imposes competition along the row dimension performs better,\nimplying that explicit attention mechanism [30] which emphasizes or suppresses relations between\nobjects helps to build more informative embedding matrix.\n\nRelation computation methods.\nIn this experiment, we investigate two commonly used relation\ncomputation methods: dot product and euclidean distance. As shown in Exp 1 and 5, we observe that\ndot-product produces slightly better result, indicating that relational embedding behavior is crucial\nfor the improvement while it is less sensitive to computation methods. Meanwhile, Exp 5 and 6\nshows that even we use euclidean distance method, geometric layout encoding module and global\ncontext encoding module further improves the overall performance, again showing the ef\ufb01cacy of the\nintroduced modules.\n\n4.3 Qualitative Evaluation\n\nVisualization of relational embedding matrix We visualize our relational embedding of our\nnetwork in Fig. 2. For each example, the bottom-left is the ground-truth binary triangular matrix\nwhere its entry is \ufb01lled as: (i, j |i < j) = 1 only if there is a non-background relationship(in any\ndirection) between the i-th and j-th instances, and 0 otherwise. The bottom-right is the trained weights\nof an intermediate relational embedding matrix (Eq. (4)), folded into a triangular form. The results\nshow that our relational embedding represents inter-dependency among all object instances, being\nconsistent with the ground-truth relationships. To illustrate, in the \ufb01rst example, the ground-truth\nmatrix refers to the relationships between the \u2019man\u2019(1) and his body parts(2,3); and the \u2019mountain\u2019(0)\nand the \u2019rocks\u2019(4,5,6,7), which are also reasonably captured in our relational embedding matrix. Note\nthat our model infers relationship correctly even there exists missing ground-truths such as cell(7,0)\ndue to sparsity of annotations in Visual Genome dataset. Indeed, our relational embedding module\n\n8\n\nFigure2:Visualizationofrelationalembeddingmatrices.Foreachexample,the\ufb01rstrowshowsground-truthobjectregions.Theleftandrightcolumnofthesecondrowshowground-truthrelations(binary,1ifpresent,0otherwise),andtheweightsofourrelationalembeddingmatrix,respectively.Notehowtherelationalembeddingrelatestheobjectswitharealconnection,comparedtothoseinnone-relationship.state-of-the-artresult,showingthatutilizingrelationalembeddingmoduleiscrucialwhiletheothermodulesfurtherboostperformance.TheeffectivenessofGLEM.Weconductadditionalanalysistoseehowthenetworkperformswiththeuseofgeometriclayoutencodingmodule.Weselecttop-10predicateswithhighestrecallincreaseinscenegraphclassi\ufb01cationtask.AsshowninrightsideofTable.3,weempiricallycon\ufb01rmthattherecallvalueofgeometricallyrelatedpredicatesaresigni\ufb01cantlyincreased,suchasusing,carrying,riding.Inotherwords,predictingpredicateswhichhaveclearsubject-objectrelativegeometry,washelpedbythemodule.Row-wiseoperationmethods.Inthisexperiment,weconductanablationstudytocomparerow-wiseoperationmethodsinrelationalembeddingmatrix:softmaxandsigmoid;AswecanseeinExp4andOurs,softmaxoperationwhichimposescompetitionalongtherowdimensionperformsbetter,implyingthatexplicitattentionmechanism[30]whichemphasizesorsuppressesrelationsbetweenobjectshelpstobuildmoreinformativeembeddingmatrix.Relationcomputationmethods.Inthisexperiment,weinvestigatetwocommonlyusedrelationcomputationmethods:dotproductandeuclideandistance.AsshowninExp1and5,weobservethatdot-productproducesslightlybetterresult,indicatingthatrelationalembeddingbehavioriscrucialfortheimprovementwhileitislesssensitivetocomputationmethods.Meanwhile,Exp5and6showsthatevenweuseeuclideandistancemethod,geometriclayoutencodingmoduleandglobalcontextencodingmodulefurtherimprovestheoverallperformance,againshowingtheef\ufb01cacyoftheintroducedmodules.4.3QualitativeEvaluationVisualizationofrelationalembeddingmatrixWevisualizeourrelationalembeddingofournetworkinFig.2.Foreachexample,thebottom-leftistheground-truthbinarytriangularmatrixwhereitsentryis\ufb01lledas:(i,j|i