{"title": "Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts", "book": "Advances in Neural Information Processing Systems", "page_first": 1912, "page_last": 1922, "abstract": "Textual grounding is an important but challenging task for human-computer inter- action, robotics and knowledge mining. Existing algorithms generally formulate the task as selection from a set of bounding box proposals obtained from deep net based systems. In this work, we demonstrate that we can cast the problem of textual grounding into a unified framework that permits efficient search over all possible bounding boxes. Hence, the method is able to consider significantly more proposals and doesn\u2019t rely on a successful first stage hypothesizing bounding box proposals. Beyond, we demonstrate that the trained parameters of our model can be used as word-embeddings which capture spatial-image relationships and provide interpretability. Lastly, at the time of submission, our approach outperformed the current state-of-the-art methods on the Flickr 30k Entities and the ReferItGame dataset by 3.08% and 7.77% respectively.", "full_text": "Interpretable and Globally Optimal Prediction for\n\nTextual Grounding using Image Concepts\n\nRaymond A. Yeh,\n\nJinjun Xiong\u2020, Wen-mei W. Hwu,\n\nMinh N. Do, Alexander G. Schwing\n\nDepartment of Electrical Engineering, University of Illinois at Urbana-Champaign\n\n\u2020IBM Thomas J. Watson Research Center\n\nyeh17@illinois.edu, jinjun@us.ibm.com, w-hwu@illinois.edu,\n\nminhdo@illinois.edu, aschwing@illinois.edu\n\nAbstract\n\nTextual grounding is an important but challenging task for human-computer inter-\naction, robotics and knowledge mining. Existing algorithms generally formulate\nthe task as selection from a set of bounding box proposals obtained from deep\nnet based systems. In this work, we demonstrate that we can cast the problem of\ntextual grounding into a uni\ufb01ed framework that permits ef\ufb01cient search over all\npossible bounding boxes. Hence, the method is able to consider signi\ufb01cantly more\nproposals and doesn\u2019t rely on a successful \ufb01rst stage hypothesizing bounding box\nproposals. Beyond, we demonstrate that the trained parameters of our model can be\nused as word-embeddings which capture spatial-image relationships and provide\ninterpretability. Lastly, at the time of submission, our approach outperformed the\ncurrent state-of-the-art methods on the Flickr 30k Entities and the ReferItGame\ndataset by 3.08% and 7.77% respectively.\n\n1\n\nIntroduction\n\nGrounding of textual phrases, i.e., \ufb01nding bounding boxes in images which relate to textual phrases, is\nan important problem for human-computer interaction, robotics and mining of knowledge bases, three\napplications that are of increasing importance when considering autonomous systems, augmented\nand virtual reality environments. For example, we may want to guide an autonomous system by using\nphrases such as \u2018the bottle on your left,\u2019 or \u2018the plate in the top shelf.\u2019 While those phrases are easy to\ninterpret for a human, they pose signi\ufb01cant challenges for present day textual grounding algorithms,\nas interpretation of those phrases requires an understanding of objects and their relations.\nExisting approaches for textual grounding, such as [38, 15] take advantage of the cognitive per-\nformance improvements obtained from deep net features. More speci\ufb01cally, deep net models are\ndesigned to extract features from given bounding boxes and textual data, which are then compared to\nmeasure their \ufb01tness. To obtain suitable bounding boxes, many of the textual grounding frameworks,\nsuch as [38, 15], make use of region proposals. While being easy to obtain, automatic extraction of\nregion proposals is limiting, because the performance of the visual grounding is inherently constrained\nby the quality of the proposal generation procedure.\nIn this work we describe an interpretable mechanism which additionally alleviates any issues arising\ndue to a limited number of region proposals. Our approach is based on a number of \u2018image concepts\u2019\nsuch as semantic segmentations, detections and priors for any number of objects of interest. Based on\nthose \u2018image concepts\u2019 which are represented as score maps, we formulate textual grounding as a\nsearch over all possible bounding boxes. We \ufb01nd the bounding box with highest accumulated score\ncontained in its interior. The search for this box can be solved via an ef\ufb01cient branch and bound\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fA woman in a green shirt is get-\nting ready to throw her bowling\nball down the lane...\n\nTwo women wearing hats cov-\nered in \ufb02owers are posing.\n\nYoung man wearing a hooded\njacket sitting on snow in front\nof mountain area.\n\nsecond bike from right in front\n\npainting next to the two on the\n\nleft\n\nperson all the way to the right\n\nFigure 1: Results on the test set for grounding of textual phrases using our branch and bound based\nalgorithm. Top Row: Flickr 30k Entities Dataset. Bottom Row: ReferItGame Dataset (Groundtruth\nbox in green and predicted box in red).\n\nscheme akin to the seminal ef\ufb01cient subwindow search of Lampert et al. [25]. The learned weights\ncan additionally be used as word embeddings. We are not aware of any method that solves textual\ngrounding in a manner similar to our approach and hope to inspire future research into the direction\nof deep nets combined with powerful inference algorithms.\nWe evaluate our proposed approach on the challenging ReferItGame [20] and the Flickr 30k Entities\ndataset [35], obtaining results like the ones visualized in Fig. 1. At the time of submission, our\napproach outperformed state-of-the-art techniques on the ReferItGame and Flickr 30k Entities dataset\nby 7.77% and 3.08% respectively using the IoU metric. We also demonstrate that the trained\nparameters of our model can be used as a word-embedding which captures spatial-image relationships\nand provides interpretability.\n\n2 Related Work\n\nTextual grounding: Related to textual grounding is work on image retrieval. Classical approaches\nlearn a ranking function using recurrent neural nets [30, 6], or metric learning [13], correlation\nanalysis [22], and neural net embeddings [9, 21]. Beyond work in image retrieval, a variety of\ntechniques have been considered to explicitly ground natural language in images and video. One of\nthe \ufb01rst models in this area was presented in [31, 24]. The authors describe an approach that jointly\nlearns visual classi\ufb01ers and semantic parsers.\nGong et al. [10] propose a canonical correlation analysis technique to associate images with descrip-\ntive sentences using a latent embedding space. In spirit similar is work by Wang et al. [42], which\nlearns a structure-preserving embedding for image-sentence retrieval. It can be applied to phrase\nlocalization using a ranking framework. In [11], text is generated for a set of candidate object regions\nwhich is subsequently compared to a query. The reverse operation, i.e., generating visual features\nfrom query text which is subsequently matched to image regions is discussed in [1].\nIn [23], 3D cuboids are aligned to a set of 21 nouns relevant to indoor scenes using a Markov random\n\ufb01eld based technique. A method for grounding of scene graph queries in images is presented in [17].\nGrounding of dependency tree relations is discussed in [19] and reformulated using recurrent nets\nin [18]. Subject-Verb-Object phrases are considered in [39] to develop a visual knowledge extraction\nsystem. Their algorithm reasons about the spatial consistency of the con\ufb01gurations of the involved\nentities. In [15, 29] caption generation techniques are used to score a set of proposal boxes and\nreturning the highest ranking one. To avoid application of a text generation pipeline on bounding\nbox proposals, [38] improve the phrase encoding using a long short-term memory (LSTM) [12]\nbased deep net. Additional modeling of object context relationship were explored in [32, 14]. Video\n\n2\n\n\f\u02c6\n\nFigure 2: Overview of our proposed approach: We obtain word priors from the input query, take into\naccount geometric features, as well as semantic segmentation features computed from the provided\ninput image. We compute the three image cues to predict the four variables of the bounding box\ny = (y1, . . . , y4).\n\ndatasets, although not directly related to our work in this paper, were used for spatiotemporal language\ngrounding in [27, 45].\nCommon datasets for visual grounding are the ReferItGame dataset [20] and a newly introduced\nFlickr 30k Entities dataset [35], which provides bounding box annotations for noun phrases of the\noriginal Flickr 30k dataset [44].\nIn contrast to all of the aforementioned methods, which are largely based on region proposals, we\nsuggest usage of ef\ufb01cient subwindow search as a suitable inference engine.\nEf\ufb01cient subwindow search: Ef\ufb01cient subwindow search was proposed by Lampert et al. [25]\nfor object localization. It is based on an extremely effective branch and bound scheme that can\nbe applied to a large class of energy functions. The approach has been applied to very ef\ufb01cient\ndeformable part models [43], for object class detection [26], for weakly supervised localization [5],\nindoor scene understanding [40], diverse object proposals [41] and also for spatio-temporal object\ndetection proposals [33].\n\n3 Exact Inference for Grounding\n\nWe outline our approach for textual grounding in Fig. 2. In contrast to the aforementioned techniques\nfor textual grounding, which typically use a small set of bounding box proposals, we formulate our\nlanguage grounding approach as an energy minimization over a large number of bounding boxes.\nThe search over a large number of bounding boxes allows us to retrieve an accurate bounding-box\nprediction for a given phrase and an image. Importantly, by leveraging ef\ufb01cient branch-and-bound\ntechniques, we are able to \ufb01nd the global minimizer for a given energy function very effectively.\nOur energy is based on a set of \u2018image concepts\u2019 like semantic segmentations, detections or image\npriors. All those concepts come in the form of score maps which we combine linearly before searching\nfor the bounding box containing the highest accumulated score over the combined score map. It is\ntrivial to add additional information to our approach by adding additional score maps. Moreover,\nlinear combination of score maps reveals importance of score maps for speci\ufb01c queries as well as\nsimilarity between queries such as \u2018skier\u2019 and \u2018snowboarder.\u2019 Hence the framework that we discuss\nin the following is easy to interpret and extend to other settings.\nGeneral problem formulation: For simplicity we use x to refer to both given input data modalities,\ni.e., x = (Q, I), with query text, Q, and image, I. We will differentiate them in the narrative. In\naddition, we de\ufb01ne a bounding box y via its top left corner (y1, y2) and its bottom right corner (y3, y4)\ni=1{0, . . . , yi,max}.\nEvery integral coordinate yi, i \u2208 {1, . . . , 4} lies within the set {0, . . . , yi,max}, and Y denotes the\n\nand subsume the four variables of interest in the tuple y = (y1, . . . , y4) \u2208 Y =(cid:81)4\n\n3\n\n9/5/2017bbest_redraw1/1Features\u00a0Word\u00a0PriorDetectionSegmentationInput\u00a0\"The\u00a0left\u00a0guy\"Image\u00a0:Query\u00a0:aguylefttheyouthEnergy\u00a0Output\u00a0:\f\u2018left\u2019\n\n\u2018center\u2019\n\nY\n\nAlgorithm 1 Branch and bound inference for\ngrounding\n1: put pair ( \u00afE(x,Y, w),Y) into queue, set \u02c6Y =\n2: repeat\n3:\n4:\n5:\n6:\n7: until | \u02c6Y| = 1\n\nsplit \u02c6Y = \u02c6Y1 \u00b7 \u02c6Y2 with \u02c6Y1 \u2229 \u02c6Y2 = \u2205\nput pair ( \u00afE(x,Y1, w),Y1) into queue\nput pair ( \u00afE(x,Y2, w),Y2) into queue\nretrieve \u02c6Y having smallest \u00afE\n\n\u2018right\u2019\n\n\u2018\ufb02oor\u2019\n\n(a)\n\n(b)\n\nFigure 3: Word priors in (a) and the employed inference algorithm in (b).\n\nproduct space of all four coordinates. For notational simplicity only, we assume all images to be\nscaled to identical dimensions, i.e., yi,max is not dependent on the input data x. We obtain a bounding\nbox prediction \u02c6y given our data x, by solving the energy minimization\n\n\u02c6y = arg min\n\ny\u2208Y E(x, y, w),\n\n(1)\n\nto global optimality. Note that w refers to the parameters of our model. Despite the fact that we are\n\u2018only\u2019 interested in a single bounding box, the product space Y is generally too large for exhaustive\nminimization of the energy speci\ufb01ed in Eq. (1). Therefore, we pursue a branch-and-bound technique\nin the following.\nTo apply branch and bound, we assume that the energy function E(x, y, w) depends on two sets\nr ]T , i.e., the top layer parameters wt of a neural net, and the remaining\nof parameters w = [wT\nparameters wr. In light of this decomposition, our approach requires the energy function to be of the\nfollowing form:\n\nt , wT\n\nE(x, y, w) = wT\n\nt \u03c6(x, y, wr).\n\nNote that the features \u03c6(x, y, wr) may still depend non-linearly on all but the top-layer parameters.\nThis assumption does not pose a severe restriction since almost all of the present-day deep net models\ntypically obtain the logits E(x, y, w) using a fully-connected layer or a convolutional layer with\nkernel size 1 \u00d7 1 as the last computation.\nEnergy Function Details: Our energy function E(x, y, w) is based on a set of \u2018image concepts,\u2019\nsuch as semantic segmentation of object categories, detections, or word priors, all of which we\nsubsume in the set C. Importantly, all image concepts c \u2208 C are attached a parametric score map\n\u02c6\u03c6c(x, wr) \u2208 RW\u00d7H following the image width W and height H. Note that those parametric score\nmaps may depend nonlinearly on some parameters wr. Given a bounding box y, we use the scalar\n\u03c6c(x, y, wr) \u2208 R to refer to the score accumulated within the bounding box y of score map \u02c6\u03c6c(x, wr).\nTo de\ufb01ne the energy function we also introduce a set of words of interest, i.e., S. Note that this set\ncontains a special symbol denoting all other words not of interest for the considered task. We use the\ngiven query Q, which is part of the data x, to construct indicators, \u03b9s = \u03b4(s \u2208 Q) \u2208 {0, 1}, denoting\nfor every token s \u2208 S its existence in the query Q, where \u03b4 denotes the indicator function.\nBased on this de\ufb01nition, we formulate the energy function as follows:\n\ns\u2208S:\u03b9s=1\n\nc\u2208C\n\nE(x, y, w) =\n\nws,c\u03c6c(x, y, wr),\n\n(2)\nwhere ws,c is a parameter connecting a word s \u2208 S to an image concept c \u2208 C. In other words,\nwt = (ws,c : \u2200s \u2208 S, c \u2208 C). This energy function results in a sparse wt, which increases the speed\nof inference.\nScore maps: The energy is given by a linear combination of accumulated score maps \u03c6c(x, y, wr).\nIn our case, we use |C| = k1 + k2 + k3 of those maps, which capture three kinds of information:\n(i) k1 word-priors; (ii) k2 geometric information cues; and (iii) k3 image based segmentations and\ndetections. We discuss each of those maps in the following.\n\n(cid:88)\n\n(cid:88)\n\n4\n\n\fApproach\n\nAccuracy (%)\n\nApproach\n\nAccuracy (%)\n\nSCRC (2016) [15]\nDSPE (2016) [42]\n\nGroundeR (2016) [38]\n\nCCA (2017) [36]\n\nOurs (Prior + Geo + Seg + Det)\nOurs (Prior + Geo + Seg + bDet)\nTable 1: Phrase localization performance on\nFlickr 30k Entities.\n\n27.80\n43.89\n47.81\n50.89\n51.63\n53.97\n\nSCRC (2016) [15]\n\nGroundeR (2016) [38]\n\nGroundeR (2016) [38] +SPAT\n\nOurs (Prior + Geo)\n\nOurs (Prior + Geo + Seg)\n\n17.93\n23.44\n26.93\n25.56\n33.36\n34.70\n\nOurs (Prior + Geo + Seg + Det)\nTable 2: Phrase localization performance on\nReferItGame.\n\n# Instances\n\npeople clothing body parts animals vehicles instruments scene other\n1,619 3,374\n5,656\n58.18 29.08\nGroundeR(2016) [38] 61.00\n64.73\n51.39 31.77\n60.38 32.45\n68.71\n\n523\n10.33\n17.21\n19.50\n\n518\n62.55\n65.83\n70.07\n\n400\n68.75\n68.75\n73.75\n\n162\n36.42\n37.65\n39.50\n\n2,306\n38.12\n46.88\n46.83\n\nCCA(2017) [36]\n\nOurs\n\nTable 3: Phrase localization performance over types on Flickr 30k Entities (accuracy in %).\n\nFor the top k1 words in the training set we construct word prior maps like the ones shown in Fig. 3 (a).\nTo obtain the prior for a particular word, we search a given training set for each occurrence of the\nword. With the corresponding subset of image-text pairs and respective bounding box annotations at\nhand, we compute the average number of times a pixel is covered by a bounding box. To facilitate\nthis operation, we scale each image to a predetermined size. Investigating the obtained word priors\ngiven in Fig. 3 (a) more carefully, it is immediately apparent that they provide accurate location\ninformation for many of the words.\nThe k2 = 2 geometric cues provide the aspect ratio and the area of the hypothesized bounding box y.\nNote that the word priors and geometry features contain no information about the image speci\ufb01cs.\nTo encode measurements dedicated to the image at hand, we take advantage of semantic segmentation\nand object detection techniques. The k3 image based features are computed using deep neural nets as\nproposed by [4, 37, 2]. We obtain probability maps for a set of class categories, i.e., a subset of the\nnouns of interest. The feature \u03c6 accumulates the scores within the hypothesized bounding box y.\nInference: The algorithm to \ufb01nd the bounding box \u02c6y with lowest energy as speci\ufb01ed in Eq. (1) is\nbased on an iterative decomposition of the output space Y [25], summarized in Fig. 3 (b). To this end\nwe search across subsets of the product space Y and we de\ufb01ne for every coordinate yi, i \u2208 {1, . . . , 4}\na corresponding lower and upper bound, yi,low and yi,high respectively. More speci\ufb01cally, considering\nthe initial set of all possible bounding boxes Y, we divide it into two disjoint subsets \u02c6Y1 and \u02c6Y2. For\nexample, by constraining y1 to {0, . . . , y1,max/2} and {y1,max/2 + 1, . . . , y1,max} for \u02c6Y1 and \u02c6Y2\nrespectively, while keeping all the other intervals unchanged. It is easy to see that we can repeat this\ndecomposition by choosing the largest among the four intervals and recursively dividing it into two\nparts.\nGiven such a repetitive decomposition strategy for the output space, and since the energy E(x, y, w)\nfor a bounding box y is obtained using a linear combination of word priors and accumulated\nsegmentation masks, we can design an ef\ufb01cient branch and bound based search algorithm to exactly\nsolve the inference problem speci\ufb01ed in Eq. (1). The algorithm proceeds by iteratively decomposing\na product space \u02c6Y into two subspaces \u02c6Y1 and \u02c6Y2. For each subspace, the algorithm computes a lower\nbound \u00afE(x,Yj, w) for the energy of all possible bounding boxes within the respective subspace.\nIntuitively, we then know, that any bounding box within the subspace \u02c6Yj has a larger energy than\nthe lower bound. The algorithm proceeds by choosing the subspace with lowest lower-bound until\nthis subspace consists of a single element, i.e., until | \u02c6Y| = 1. We summarize this algorithm in Alg. 1\n(Fig. 3 (b)).\nTo this end, it remains to show how to compute a lower bound \u00afE(x,Yj, w) on the energy for an\noutput space, and to illustrate the conditions which guarantee convergence to the global minimum of\nthe energy function.\nFor the latter, we note that two conditions are required to ensure convergence to the optimum: (i) the\nbound of the considered product space has to lower-bound the true energy for each of its bounding\n\n5\n\n\fThe lady in the red car is\n\ncrossing the bridge.\n\nA dog and a cow play together\n\ninside the fence.\n\nA woman wearig the black\n\nsunglasses and blue jean jacket\n\nis smiling.\n\nperson on the left\n\nblack bottle front\n\n\ufb02oor on the bottom\n\nFigure 4: Results on the test set for grounding of textual phrases using our branch and bound based\nalgorithm. Top Row: Flickr 30k Entities Dataset. Bottom Row: ReferItGame Dataset (Groundtruth\nbox in green and predicted box in red).\nbox hypothesis \u02c6y \u2208 \u02c6Y, i.e., \u2200\u02c6y \u2208 \u02c6Y, \u00afE(x, \u02c6Y, w) \u2264 E(x, \u02c6y, w); (ii) the bound has to be exact for all\npossible bounding boxes y \u2208 Y, i.e., \u00afE(x, y, w) = E(x, y, w). Given those two conditions, global\nconvergence of the algorithm summarized in Alg. 1 is apparent: upon termination we obtain an\n\u2018interval\u2019 containing a single bounding box, and its energy is at least as low as the one for any other\ninterval.\nFor the former, we note that bounds on score maps for bounding box intervals can be computed by\nconsidering either the largest or the smallest possible bounding box in the bounding box hypothesis,\n\u02c6Y, depending on whether the corresponding weight in wt is positive or negative and whether the\nfeature maps contain only positive or negative values. Intuitively, if the weight is positive and\nthe feature mask contains only positive values, we obtain the smallest lower bound \u00afE(x, \u02c6Y, w) by\nconsidering the content within the smallest possible bounding box. Note that the score maps do not\nnecessarily contain only positive or negative numbers. However we can split the given score maps\ninto two separate score maps (i.e., one with only positive values, and another with only negative\nvalues) while applying the same weight.\nIt is important to note that computation of the bound \u00afE(x, \u02c6Y, w) has to be extremely effective for the\nalgorithm to run at a reasonable speed. However, computing the feature mask content for a bounding\nbox is trivially possible using integral images. This results in a constant time evaluation of the bound,\nwhich is a necessity for the success of the branch and bound procedure.\nLearning the Parameters: With the branch and bound based inference procedure at hand, we\nnow describe how to formulate the learning task. Support-vector machine intuition can be applied.\nFormally, we are given a training set D = {(x, y)} containing pairs of input data x and groundtruth\nbounding boxes y. We want to \ufb01nd the parameters w of the energy function E(x, y, w) such that\nthe energy of the groundtruth is smaller than the energy of any other con\ufb01guration. Negating this\nstatement results in the following desiderata when including an additional margin term L(y, \u02c6y), also\nknown as task-loss, which measures the loss between the groundtruth y and another con\ufb01guration \u02c6y:\n\n\u2212E(x, y, w) \u2265 \u2212E(x, \u02c6y, w) + L(\u02c6y, y) \u2200\u02c6y \u2208 Y.\n\nSince we want to enforce this inequality for all con\ufb01gurations \u02c6y \u2208 Y, we can reduce the number of\nconstraints by enforcing it for the highest scoring right hand side. We then design a cost function\nwhich penalizes violation of this requirement linearly. We obtain the following structured support\nvector machine based surrogate loss minimization:\n\nmin\n\nw\n\n(cid:107)w(cid:107)2\n\n2 +\n\nC\n2\n\n\u02c6y\u2208Y (\u2212E(x, \u02c6y, w) + L(\u02c6y, y)) + E(x, y, w)\n\nmax\n\n(3)\n\n(cid:88)\n\n(x,y)\u2208D\n\nwhere C is a hyperparameter adjusting the squared norm regularization to the data term. For the task\nloss L(\u02c6y, y) we use intersection over union (IoU).\n\n6\n\n\fher shoes\n\na red shirt\n\na dirt bike\n\nFigure 5: Flickr 30k Failure Cases. (Green box: ground-truth, Red box:predicted)\n\nBy \ufb01xing the parameters wr and only learning the top layer parameters wt, Eq. (3) is equivalent to the\nproblem of training a structured SVM. We found the cutting-plane algorithm [16] to work well in our\ncontext. The cutting-plane algorithm involves solving the maximization task. This maximization over\nthe output space Y is commonly referred to as loss-augmented inference. Loss augmented inference\nis structurally similar to the inference task given in Eq. (1). Since maximization is identical to\nnegated minimization, the computation of the bounds for the energy E(x, \u02c6y, w) remains identical. To\nbound the IoU loss, we note that a quotient can be bounded by bounding nominator and denominator\nindependently. To lower bound the intersection of the groundtruth box with the hypothesis space we\nuse the smallest hypothesized bounding box. To upper bound the union of the groundtruth box with\nthe hypothesis space we use the largest bounding box.\nFurther, even though not employed to obtain the results in this paper, we mention that it is possible\nto backpropagate through the neural net parameters wr that in\ufb02uence the energy non-linearly. This\nunderlines that our initial assumption is merely a construct to design an effective inference procedure.\n\n4 Experimental Evaluation\n\nIn the following we \ufb01rst provide additional details of our implementation before discussing the results\nof our approach.\nLanguage processing: In order to process free-form textual phrases ef\ufb01ciently, we restricted the\nvocabulary size to the top 200 most frequent words in the training set for the ReferItGame, and to the\ntop 1000 most frequent training set words for Flickr 30k Entities; both choices cover about 90% of\nall phrases in the training set. We map all the remaining words into an additional token. We don\u2019t\ndifferentiate between uppercase and lower case characters and we also ignore punctuation.\nSegmentation and detection maps: We employ semantic segmentation, object detection, and pose-\nestimation. For segmentation, we use the DeepLab system [4], trained on PASCAL VOC-2012 [8]\nsemantic image segmentation task, to extract the probability maps for 21 categories. For detection,\nwe use the YOLO object detection system [37], to extract 101 categories, 21 trained on PASCAL\nVOC-2012, and 80 trained on MSCOCO [28]. For pose estimation, we use the system from [2] to\nextract the body part location, then post-process to get the head, upper body, lower body, and hand\nregions.\nFor the ReferItGame, we further \ufb01ne-tuned the last layer of the DeepLab system to include the\ncategories of \u2018sky,\u2019 \u2018ground,\u2019 \u2018building,\u2019 \u2018water,\u2019 \u2018tree,\u2019 and \u2018grass.\u2019 For the Flickr 30k Entities, we\nalso \ufb01ne-tuned the last layer of the DeepLab system using the eight coarse-grained types and eleven\ncolors from [36].\nPreprocessing and post-processing: For word prior feature maps and the semantic segmenta-\ntion maps, we take an element-wise logarithm to convert the normalized feature counts into log-\nprobabilities. The summation over a bounding box region then retains the notion of a joint log-\nprobability. We also centered the feature maps to be zero-mean, which corresponds to choosing\nan initial decision threshold. The feature maps are resized to dimension of 64 \u00d7 64 for ef\ufb01cient\ncomputation, and the predicted box is scaled back to the original image dimension during evaluation.\nWe re-center the prediction box by a constant amount determined using the validation set, as resizing\ntruncate box coordinates to an integer.\nEf\ufb01cient sub-window search implementation: In order for the ef\ufb01cient subwindow search to run\nat a reasonable speed, the lower bound on E needs to be computed as fast as possible. Observe\nthat, E(x, y, w), is a weighted sum of the feature maps over the region speci\ufb01ed by a hypothesized\nbounding box. To make this computation ef\ufb01cient, we pre-compute integral images. Given an integral\n\n7\n\n\f(a)\n\n(b)\n\nFigure 6: (a) Trained weight, ws,c, visualization on words, s and segmentation concepts, c, on Flicker\n30k. (b) Cosine similairty visualization between words vector, ws and w(cid:48)\n\ns on Flicker 30k.\n\nimage, the computation for each of the bounding box is simply a look-up operation. This trick can\nsimilarly be applied for the geometric features. Since we know the range of the ratio and areas of the\nbounding boxes ahead of time, we cache the results in a look up table as well.\nThe ReferItGame dataset consists of more than 99,000 regions from 20,000 images. Bounding\nboxes are assigned to natural language expressions. We use the same bounding boxes as [38] and\nthe same training test set split, i.e., 10,000 images for testing, 9,000 images for training and 1,000\nimages for validation.\nThe Flickr 30k Entities dataset consists of more than 275k bounding boxes from 31k image, where\neach bounding box is annotated with the corresponding natural language phrase. We us the same\ntraining, validation and testing split as in [35].\nQuantitative evaluation: In Tab. 1 and Tab. 2 we quantitatively compare the results of our approach\nto recent state-of-the-art baselines, where Prior = word priors, Geo = geometric information, Seg\n= Segmentation maps, Det = Detection maps, bDet = Detection maps + body parts detection. An\nexample is considered as correct, if the predicted box overlaps with the ground-truth box by more\nthan 0.5 IoU. We observe our approach to outperform competing methods by around 3% on the Flickr\n30k Entities dataset and by around 7% on the ReferItGame dataset.\nWe also provide an ablation study of the word and image information as shown in Tab. 1 and Tab. 2.\nIn Tab. 3 we analyze the results for each \u201cphrase type\u201d provided by Flicker30k Entities dataset. As\ncan be seen, our system outperforms the state-of-the-art in all phrase types except for clothing.\nWe note that our results have been surpassed by [3, 7, 34], where they \ufb01ne-tuned the entire network\nincluding the feature extractions or trained more feature detectors; CCA, GroundeR and our approach\nuses a \ufb01xed pre-trained network for extracting image features.\nQualitative evaluation: Next we evaluate our approach qualitatively. In Fig. 1 and Fig. 4 we show\nsuccess cases. We observe that our method successfully captures a variety of objects and scenes. In\nFig. 5 we illustrate failure cases. We observe that for a few cases word prior may hurt the prediction\n(e.g., shoes are typically on the bottom half of the image.) Also our system may fail when the energy\nis not a linear combination of the feature scores. For example, the score of \u201cdirt bike\u201d should not be\nthe score of \u201cdirt\u201d + the score of \u201cbike.\u201d We provide additional results in the supplementary material.\nLearned parameters + word embedding: Recall, in Eq. (2), our model learns a parameter per\nphrase word and concept pair, ws,c. We visualize its magnitude in Fig. 6 (a) for a subset of words and\nconcepts. As can be seen, ws,c is large, when the phrase word and the concept are related, (e.g. s =\nship and c = boat). This demonstrates that our model successfully learns the relationship between\nphrase words and image concepts. This also means that the \u201cword vector,\u201d ws = [ws,1, ws,2, ...ws,|C|],\ncan be interpreted as a word embedding. Therefore, in Fig. 6 (b), we visualize the cosine similarity\nbetween pairs of word vectors. Expected groups of words form, for example (bicycle, bike), (camera,\ncellphone), (coffee, cup, drink), (man woman), (snowboarder, skier). The word vectors capture\n\n8\n\naeroplanebicyclebirdboatbottlebuscarcatchairConcept, cplanebikebirdshipbottlebuscarcatchairQuery word, s0.00.20.40.60.81.01.21.41.6bicyclebikecameracellphonecoffeecupdrinkmanskiersnowboarderwomanQuery word, s0bicyclebikecameracellphonecoffeecupdrinkmanskiersnowboarderwomanQuery word, s0.000.150.300.450.600.750.90\fimage-spatial relationship of the words, meaning items that can be \u201creplaced\u201d in an image are similar;\n(e.g., a \u201csnowboarder\u201d can be replaced with a \u201cskier\u201d and the overall image would still be reasonable).\nComputational Ef\ufb01ciency: Overall, our method\u2019s inference speed is comparable to CCA and much\nfaster than GroundeR. The inference speed can be divided into three main parts, (1) extracting image\nfeatures, (2) extracting language features, and (3) computing scores. For extracting image features,\nGroundeR requires a forward pass on VGG16 for each image region, where CCA and our approach\nrequires a single forward pass which can be done in 142.85 ms. For extracting language features, our\nmethod requires index lookups, which takes negligible amount of time (less than 1e-6 ms). CCA,\nuses Word2vec for processing the text, which takes 0.070 ms. GroundeR uses a Long-Short-Term\nMemory net, which takes 0.7457 ms. Computing the scores with our C++ implementation takes\n1.05ms on a CPU. CCA needs to compare projections of the text and image features, which takes\n13.41ms on a GPU and 609ms on a CPU. GroundeR uses a single fully connected layer, which takes\n0.31 ms on a GPU.\n\n5 Conclusion\n\nWe demonstrated a mechanism for grounding of textual phrases which provides interpretability, is\neasy to extend, and permits globally optimal inference. In contrast to existing approaches which are\ngenerally based on a small set of bounding box proposals, we ef\ufb01ciently search over all possible\nbounding boxes. We think interpretability, i.e., linking of word and image concepts, is an important\nconcept, particularly for textual grounding, which deserves more attention.\n\nAcknowledgments: This material is based upon work supported in part by the National Science\nFoundation under Grant No. 1718221. This work is supported by NVIDIA Corporation with the\ndonation of a GPU. This work is supported in part by IBM-ILLINOIS Center for Cognitive Computing\nSystems Research (C3SR) - a research collaboration as part of the IBM Cognitive Horizons Network.\n\n9\n\n\fReferences\n[1] R. Arandjelovic and A. Zisserman. Multiple queries for large scale speci\ufb01c object retrieval. In Proc.\n\nBMVC, 2012.\n\n[2] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part af\ufb01nity\n\ufb01elds. In Proc. CVPR, 2017.\n[3] K. Chen\u2217, R. Kovvuri\u2217, and R. Nevatia. Query-guided regression network with context policy for phrase\ngrounding. In Proc. ICCV, 2017. \u2217 equal contribution.\n[4] L.-C. Chen\u2217, G. Papandreou\u2217, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic Image Segmentation\nwith Deep Convolutional Nets and Fully Connected CRFs. In Proc. ICLR, 2015. (\u2217equal contribution).\n\n[5] T. Deselaers, B. Alexe, and V. Ferrari. Weakly supervised localization and learning with generic knowledge.\n\nIJCV, 2012.\n\n[6] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell.\nLong-term recurrent convolutional networks for visual recognition and description. In Proc. CVPR, 2015.\n[7] K. Endo, M. Aono, E. Nichols, and K. Funakoshi. An attention-based regression model for grounding\n\ntextual phrases in images. In Proc. IJCAI, 2017.\n\n[8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes\n\n[9] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. Devise: A deep visual-semantic\n\n(voc) challenge. IJCV, 2010.\n\nembed- ding model. In Proc. NIPS, 2013.\n\n[10] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving image-sentence embeddings\n\nusing large weakly annotated photo collections. In Proc. ECCV, 2014.\n\n[11] S. Guadarrama, E. Rodner, K. Saenko, N. Zhang, R. Farrell, J. Donahue, and T. Darrell. Open-vocabulary\n\nobject retrieval. In Proc. RSS, 2014.\n\n[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.\n[13] S. C. Hoi, W. Liu, M. R. Lyu, and W.-Y. Ma. Learning distance metrics with contextual constraints for\n\nimage retrieval. In Proc. CVPR, 2006.\n\n[14] R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko. Modeling relationships in referential\n\nexpressions with compositional modular networks. In Proc. CVPR, 2017.\n\n[15] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. In Proc.\n\nCVPR, 2016.\n\n[16] T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural svms. Machine Learning, 2009.\n[17] J. Johnson, R. Krishna, M. Stark, L. J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei. Image retrieval using\n\n[18] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proc.\n\nscene graphs. In Proc. CVPR, 2015.\n\nCVPR, 2015.\n\nmapping. In Proc. NIPS, 2014.\n\nof natural scenes. In Proc. EMNLP, 2014.\n\nneural language models. In TACL, 2015.\n\n[19] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional image sentence\n\n[20] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. ReferItGame: Referring to objects in photographs\n\n[21] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal\n\n[22] B. Klein, G. Lev, G. Sadeh, and L. Wolf. Fisher vectors derived from hybrid gaussian-laplacian mixture\n\nmodels for image annotation. In arXiv preprint arXiv:1411.7399, 2014.\n\n[23] C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler. What are you talking about? text-to-image\n\n[24] J. Krishnamurthy and T. Kollar. Jointly learning to parse and perceive: connecting natural language to the\n\ncoreference. In Proc. CVPR, 2014.\n\nphysical world. In Proc. TACL, 2013.\n\n[25] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Ef\ufb01cient Subwindow Search: A Branch and Bound\n\nFramework for Object Localization. PAMI, 2009.\n\n[26] A. Lehmann, B. Leibe, and L. V. Gool. Fast PRISM: Branch and Bound Hough Transform for Object\n\n[27] D. Lin, S. Fidler, C. Kong, and R. Urtasun. Visual semantic search: Retrieving videos via complex textual\n\nClass Detection. IJCV, 2011.\n\nqueries. In Proc. CVPR, 2014.\n\n[28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick. Microsoft\n\ncoco: Common objects in context. In Proc. ECCV, 2014.\n\n[29] J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy. Generation and comprehension of\n\nunambiguous object descriptions. In Proc. CVPR, 2016.\n\n[30] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent\n\nneural networks (m-rnn). In Proc. ICLR, 2015.\n\n[31] C. Matuszek, N. Fitzgerald, L. Zettlemoyer, L. Bo, and D. Fox. A joint model of language and perception\n\nfor grounded attribute learning. In Proc. ICML, 2012.\n\n[32] V. K. Nagaraja, V. I. Morariu, and L. S. Davis. Modeling context between objects for referring expression\n\nunderstanding. In Proc. ECCV, 2016.\n\nECCV, 2014.\n\n[33] D. Oneata, J. Revaud, J. Verbeek, and C. Schmid. Spatio-temporal object detection proposals. In Proc.\n\n[34] B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier, and S. Lazebnik. Phrase localization and\n\nvisual relationship detection with comprehensive image-language cues. In Proc. ICCV, 2017.\n\n[35] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. Collecting\n\nregion-to-phrase correspondences for richer image-to- sentence models. In Proc. ICCV, 2015.\n\n10\n\n\f[36] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k\nentities: Collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV, 2017.\n\n[37] J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In CVPR, 2017.\n[38] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of Textual Phrases in Images by\n\nReconstruction. In Proc. ECCV, 2016.\n\n[39] F. Sadeghi, S. K. Divvala, and A. Farhadi. Viske: Visual knowledge extraction and question answering by\n\nvisual veri\ufb01cation of relation phrases. In Proc. CVPR, 2015.\n\n[40] A. G. Schwing and R. Urtasun. Ef\ufb01cient Exact Inference for 3D Indoor Scene Understanding. In Proc.\n\n[41] Q. Sun and D. Batra. Submodboxes: Near-optimal search for a set of diverse object proposals. In Proc.\n\n[42] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preserving image-text em- beddings. In Proc.\n\n[43] J. Yan, Z. Lei, L. Wen, and S. Z. Li. The Fastest Deformable Part Model for Object Detection. In Proc.\n\nECCV, 2012.\n\nNIPS, 2015.\n\nCVPR, 2016.\n\nCVPR, 2014.\n\nACL, 2013.\n\n[44] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New\n\nsimilarity metrics for semantic inference over event descriptions. In Proc. TACL, 2014.\n\n[45] H. Yu and J. M. Siskind. Grounded language learning from video described with sen- tences. In Proc.\n\n11\n\n\f", "award": [], "sourceid": 1192, "authors": [{"given_name": "Raymond", "family_name": "Yeh", "institution": "University of Illinois at Urbana\u2013Champaign"}, {"given_name": "Jinjun", "family_name": "Xiong", "institution": "IBM Research"}, {"given_name": "Wen-Mei", "family_name": "Hwu", "institution": null}, {"given_name": "Minh", "family_name": "Do", "institution": "University of Illinois"}, {"given_name": "Alexander", "family_name": "Schwing", "institution": "University of Illinois at Urbana-Champaign"}]}