{"title": "Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos", "book": "Advances in Neural Information Processing Systems", "page_first": 536, "page_last": 546, "abstract": "Temporal sentence grounding in videos aims to detect and localize one target video segment, which semantically corresponds to a given sentence. Existing methods mainly tackle this task via matching and aligning semantics between a sentence and candidate video segments, while neglect the fact that the sentence information plays an important role in temporally correlating and composing the described contents in videos. In this paper, we propose a novel semantic conditioned dynamic modulation (SCDM) mechanism, which relies on the sentence semantics to modulate the temporal convolution operations for better correlating and composing the sentence related video contents over time. More importantly, the proposed SCDM performs dynamically with respect to the diverse video contents so as to establish a more precise matching relationship between sentence and video, thereby improving the temporal grounding accuracy. Extensive experiments on three public datasets demonstrate that our proposed model outperforms the state-of-the-arts with clear margins, illustrating the ability of SCDM to better associate and localize relevant video contents for temporal sentence grounding. Our code for this paper is available at https://github.com/yytzsy/SCDM.", "full_text": "Semantic Conditioned Dynamic Modulation\nfor Temporal Sentence Grounding in Videos\n\nYitian Yuan\u2217\n\nTsinghua-Berkeley Shenzhen Institute\n\nTsinghua University\n\nyyt18@mails.tsinghua.edu.cn\n\nLin Ma\n\nTencent AI Lab\n\nforest.linma@gmail.com\n\nJingwen Wang\nTencent AI Lab\n\njaywongjaywong@gmail.com\n\nWei Liu\n\nTencent AI Lab\n\nWenwu Zhu\n\nTsinghua University\n\nwl2223@columbia.edu\n\nwwzhu@tsinghua.edu.cn\n\nAbstract\n\nTemporal sentence grounding in videos aims to detect and localize one target video\nsegment, which semantically corresponds to a given sentence. Existing methods\nmainly tackle this task via matching and aligning semantics between a sentence and\ncandidate video segments, while neglect the fact that the sentence information plays\nan important role in temporally correlating and composing the described contents in\nvideos. In this paper, we propose a novel semantic conditioned dynamic modulation\n(SCDM) mechanism, which relies on the sentence semantics to modulate the\ntemporal convolution operations for better correlating and composing the sentence-\nrelated video contents over time. More importantly, the proposed SCDM performs\ndynamically with respect to the diverse video contents so as to establish a more\nprecise matching relationship between sentence and video, thereby improving\nthe temporal grounding accuracy. Extensive experiments on three public datasets\ndemonstrate that our proposed model outperforms the state-of-the-arts with clear\nmargins, illustrating the ability of SCDM to better associate and localize relevant\nvideo contents for temporal sentence grounding. Our code for this paper is available\nat https://github.com/yytzsy/SCDM .\n\n1\n\nIntroduction\n\nDetecting or localizing activities in videos [18, 30, 26, 34, 11, 29, 21, 9, 8] is a prominent while\nfundamental problem for video understanding. As videos often contain intricate activities that cannot\nbe indicated by a prede\ufb01ned list of action classes, a new task, namely temporal sentence grounding\nin videos (TSG) [14, 10], has recently attracted much research attention [2, 36, 3, 19, 4, 5, 35].\nFormally, given an untrimmed video and a natural sentence query, the task aims to identify the start\nand end timestamps of one speci\ufb01c video segment, which contains activities of interest semantically\ncorresponding to the given sentence query.\nMost of existing approaches [10, 14, 19, 4] for the TSG task often sample candidate video segments\n\ufb01rst, then fuse the sentence and video segment representations together, and thereby evaluate their\nmatching relationships based on the fused features. Lately, some approaches [2, 36] try to directly\nfuse the sentence information with each video clip, then employ an LSTM or a ConvNet to compose\nthe fused features over time, and thus predict the temporal boundaries of the target video segment.\nWhile promising results have been achieved, there are still several problems that need to be concerned.\n\n\u2217This work was done while Yitian Yuan was a Research Intern at Tencent AI Lab.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: The temporal sentence grounding in videos (TSG) task. Our proposed SCDM relies on the sentence\nto modulate the temporal convolution operations, which can thereby temporally correlate and compose the\nvarious sentence-related activities (highlighted in red and green) for more accurate grounding results.\n\nFirst, previous methods mainly focus on semantically matching sentences and individual video\nsegments or clips, while neglect the important guiding role of sentences to help correlate and compose\nvideo contents over time. For example, the target video sequence shown in Figure 1 mainly expresses\ntwo distinct activities \u201cwoman walks cross the room\u201d and \u201cwoman reads the book on the sofa\u201d.\nWithout referring to the sentence, these two distinct activities are not easy to be associated as one\nwhole event. However, the sentence clearly indicates that \u201cThe woman takes the book across the room\nto read it on the sofa\u201d. Keeping such a semantic meaning in mind, persons can easily correlate the two\nactivities together and thereby precisely determine the temporal boundaries. Therefore, how to make\nuse of the sentence semantics to guide the composing and correlating of relevant video contents over\ntime is very crucial for the TSG task. Second, activities contained in videos are usually of diverse\nvisual appearances, and present in various temporal scales. Therefore, the sentence guidance for\ncomposing and correlating video contents should also be considered in different temporal granularities\nand dynamically evolve with the diverse visual appearances.\nIn this paper, we propose a novel semantic conditioned dynamic modulation (SCDM) mechanism,\nwhich leverages sentence semantic information to modulate the temporal convolution processes in a\nhierarchical temporal convolutional network. The SCDM manipulates the temporal feature maps by\nadjusting the scaling and shifting parameters for feature normalization with referring to the sentence\nsemantics. As such, the temporal convolution process is activated to better associate and compose\nsentence-related video contents over time. More speci\ufb01cally, such a modulation dynamically evolves\nwhen processing different convolutional layers and different locations of feature maps, so as to\nbetter align the sentence and video semantics under diverse video contents and various granularities.\nCoupling SCDM with the temporal convolutional network, our proposed model naturally characterizes\nthe interaction behaviors between sentence and video, leading to a novel and effective architecture\nfor the TSG task.\nOur main contributions are summarized as follows. (1) We propose a novel semantic conditioned\ndynamic modulation (SCDM) mechanism, which dynamically modulates the temporal convolution\nprocedure by referring to the sentence semantic information. In doing so, the sentence-related\nvideo contents can be temporally correlated and composed to yield a precise temporal boundary\nprediction. (2) Coupling the proposed SCDM with the hierarchical temporal convolutional network,\nour model naturally exploits the complicated semantic interactions between sentence and video in\nvarious temporal granularities. (3) We conduct experiments on three public datasets, and verify\nthe effectiveness of the proposed SCDM mechanism as well as its coupled temporal convolution\narchitecture with the superiority over the state-of-the-art methods.\n\n2 Related Works\n\nTemporal sentence grounding in videos is a new task introduced recently [10, 14]. Some previous\nworks [10, 14, 19, 33, 4, 12] often adopted a two-stage multimodal matching strategy to solve\nthis problem. They sampled candidate segments from a video \ufb01rst, then integrated the sentence\nrepresentation with those video segments individually, and thus evaluated their matching relationships\nthrough the integrated features. With the above multimodal matching framework, Hendricks et al.\n[14] further introduced temporal position features of video segments into the feature fusion procedure;\nGao et al. [10] established a location regression network to adjust the temporal position of the\ncandidate segment to the target segment; Liu et al. [19] designed a memory attention mechanism\nto emphasize the visual features mentioned in the sentence; Xu et al. [33] and Chen et al. [4]\nproposed to generate query-speci\ufb01c proposals as candidate segments; Ge et al. [12] investigated\nactivity concepts from both videos and queries to enhance the temporal sentence grounding.\n\n2\n\nSentence query: The woman takes the book across the room to read it on the sofa.start pointend point\fFigure 2: An overview of our proposed model for the TSG task, which consists of three fully-coupled\ncomponents. The multimodal fusion fuses the entire sentence and each video clip in a \ufb01ne-grained manner.\nBased on the fused representation, the semantic modulated temporal convolution correlates sentence-related\nvideo contents in the temporal convolution procedure, with the proposed SCDM dynamically modulating\ntemporal feature maps with reference to the sentence. Finally, the position prediction outputs the location offsets\nand overlap scores of candidate video segments based on the modulated features. Best viewed in color.\n\nRecently, some other works [4, 36] proposed to directly integrate sentence information with each\n\ufb01ne-grained video clip unit, and then predicted the temporal boundary of the target segment by\ngradually merging the fusion feature sequence over time in an end-to-end fashion. Speci\ufb01cally, Chen\net al. [2] aggregated frame-by-word interactions between video and language through a Match-LSTM\n[31]. Zhang et al. [36] adopted the Graph Convolutional Network (GCN) [16] to model relations\namong candidate segments produced from a convolutional neural network.\nAlthough promising results have been achieved by existing methods, they all focus on better aligning\nsemantic information between sentence and video, while neglect the fact that sentence information\nplays an important role in correlating the described activities in videos. Our work \ufb01rstly introduces\nthe sentence information as a critical prior to compose and correlate video contents over time, subse-\nquently sentence-guided video composing is dynamically performed and evolved in a hierarchical\ntemporal convolution architecture, in order to cover the diverse video contents of various temporal\ngranularities.\n\n3 The Proposed Model\n\nGiven an untrimmed video V and a sentence query S, the TSG task aims to determine the start and\nend timestamps of one video segment, which semantically corresponds to the given sentence query. In\norder to perform the temporal grounding, the video is \ufb01rst represented as V = {vt}T\nt=1 clip-by-clip,\nand accordingly the query sentence is represented as S = {sn}N\nn=1 word-by-word.\nIn this paper, we propose one novel model to handle the TSG task, as illustrated in Figure 2.\nSpeci\ufb01cally, the proposed model consists of three components, namely the multimodal fusion, the\nsemantic modulated temporal convolution, and the position prediction. Please note that the three\ncomponents fully couple together and can therefore be trained in an end-to-end manner.\n\n3.1 Multimodal Fusion\n\nThe TSG task requires to understand both the sentence and video. As such, in order to correlate their\ncorresponding semantic information, we \ufb01rst let each video clip meet and interact with the entire\nsentence, which is formulated as:\n\nft = ReLU(cid:0)Wf(cid:0)vt(cid:107)s(cid:1) + bf(cid:1) ,\n\n(1)\nwhere Wf and bf are the learnable parameters. s denotes the global sentence representation,\nwhich can be obtained by simply averaging the word-level sentence representation S. With such\na multimodal fusion strategy, the yielded representation F = {ft}T\nt=1 \u2208 RT\u00d7df captures the\ninteractions between sentence and video clips in a \ufb01ne-grained manner. The following semantic\nmodulated temporal convolution will gradually correlate and compose such representations together\nover time, expecting to help produce accurate temporal boundary predictions of various scales.\n\n3\n\nA Bi-GRUAvgPredicted SegmentsC3DSemantic Modulated Temporal Convolution Multimodal Fusion Position Prediction Convs1s2sNTileFcConvConvSCDMSCDMSCDMMLPMLPSoft AttentionSemantic Conditioned Dynamic ModulationConvcidogfrisbeeInput VideoSentence Query\f3.2 Semantic Modulated Temporal Convolution\n\nAs aforementioned, the sentence-described activities in videos may have various durations and scales.\nTherefore, the fused multimodal representation F should be exploited from different temporal scales\nto comprehensively characterize the temporal diversity of video activities. Inspired by the ef\ufb01cient\nsingle-shot object and action detections [20, 18], the temporal convolutional network established\nvia one hierarchical architecture is used to produce multi-scale features to cover the activities of\nvarious durations. Moreover, in order to fully exploit the guiding role of the sentence, we propose one\nnovel semantic conditioned dynamic modulation (SCDM) mechanism, which relies on the sentence\nsemantics to modulate the temporal convolution operations for better correlating and composing the\nsentence-related video contents over time. In the following, we \ufb01rst review the basics of the temporal\nconvolutional network. Afterwards, the proposed SCDM will be described in details.\n\n3.2.1 Temporal Convolutional Network\n\nTaking the multimodal fusion representation F as input, the standard temporal convolution operation\nin this paper is denoted as Conv(\u03b8k, \u03b8s, dh), where \u03b8k, \u03b8s, and dh indicate the kernel size, stride\nsize, and \ufb01lter numbers, respectively. Meanwhile, the nonlinear activation, such as ReLU, is then\nfollowed with the convolution operation to construct a basic temporal convolutional layer. By setting\n\u03b8k as 3 and \u03b8s as 2, respectively, each convolutional layer will halve the temporal dimension of the\ninput feature map and meanwhile expand the receptive \ufb01eld of each feature unit within the map. By\nstacking multiple layers, a hierarchical temporal convolutional network is constructed, with each\nfeature unit in one speci\ufb01c feature map corresponding to one speci\ufb01c video segment in the original\nvideo. For brevity, we denote the output feature map of the k-th temporal convolutional layer as\ni=1 \u2208 RTk\u00d7dh, where Tk = Tk\u22121/2 is the temporal dimension, and ak,i \u2208 Rdh\nAk = {ak,i}Tk\ndenotes the i-th feature unit at the the k-th layer feature map.\n\n3.2.2 Semantic Conditioned Dynamic Modulation\n\nRegarding video activity localization, besides the video clip contents, their temporal correlations\nplay an even more important role. For the TSG task, the query sentence, presenting rich semantic\nindications on such important correlations, provides crucial information to temporally associate\nand compose the consecutive video contents over time. Based on the above considerations, in this\npaper, we propose a novel SCDM mechanism, which relies on the sentence semantic information to\ndynamically modulate the feature composition process in each temporal convolutional layer.\nSpeci\ufb01cally, as shown in Figure 3(b), given the sentence representation S = {sn}N\nn=1 and one feature\nmap extracted from one speci\ufb01c temporal convolutional layer A = {ai} (we omit the layer number\nhere), we attentively summarize the sentence representation to ci with respect to each feature unit ai:\n\ni = softmax(cid:0)w(cid:62)tanh (Wssn + Waai + b)(cid:1),\n\n\u03c1n\n\nN(cid:88)\n\nci =\n\n\u03c1n\ni sn,\n\n(2)\n\nn=1\n\nwhere w, Ws, Wa, and b are the learnable param-\neters. Afterwards, two fully-connected (FC) layers\nwith the tanh activation function are used to generate\ni \u2208 Rdh,\ntwo modulation vectors \u03b3c\nrespectively:\n\ni \u2208 Rdh and \u03b2c\n\ni = tanh(W\u03b3ci + b\u03b3),\n\u03b3c\ni = tanh(W\u03b2ci + b\u03b2),\n\u03b2c\n\n(3)\n\nwhere W\u03b3, b\u03b3, W\u03b2, and b\u03b2 are the learnable pa-\nrameters. Finally, based on the generated modulation\ni , the feature unit ai is modulated as:\nvectors \u03b3c\ni \u00b7 ai \u2212 \u00b5(A)\n\ni and \u03b2c\n\n\u02c6ai = \u03b3c\n\n+ \u03b2c\ni .\n\n\u03c3(A)\n\n(4)\n\nWith the proposed SCDM, the temporal feature maps,\nyielded during the temporal convolution process, are\n\n4\n\nFigure 3: The comparison between conditional\nnormalization and our proposed semantic condi-\ntioned dynamic modulation.\n\nAttentionS1S2SN(b) Semantic Conditioned Dynamic Modulation(a) Conditional NormalizationSCDMConditionalNormalization\fmeticulously modulated by scaling and shifting the corresponding normalized features under the\nsentence guidance. As such, each temporal feature map will absorb the sentence semantic information,\nand further activate the following temporal convolutional layer to better correlate and compose the\nsentence-related video contents over time. Coupling the proposed SCDM with each temporal\nconvolutional layer, we thus obtain the novel semantic modulated temporal convolution as shown in\nthe middle part of Figure 2.\nDiscussion. As shown in Figure 3, our proposed SCDM differs from the existing conditional\nbatch/instance normalization [6, 7], where the same \u03b3c and \u03b2c are applied within the whole\nbatch/instance. On the contrary, as indicated in Equations (2)-(4), our SCDM dynamically ag-\ngregates the meaningful words with referring to different video contents, making the yielded \u03b3c\nand \u03b2c dynamically evolve for different temporal units within each speci\ufb01c feature map. Such a\ndynamic modulation enables each temporal feature unit to be interacted with each word to collect\nuseful grounding cues along the temporal dimension. Therefore, the sentence-video semantics can\nbe better aligned over time to support more precise boundary predictions. Detailed experimental\ndemonstrations will be given in Section 4.5.\n\n3.3 Position Prediction\n\nSimilar to [20, 18] for object/action detections, during the prediction, lower and higher temporal\nconvolutional layers are used to localize short and long activities, respectively. As illustrated in\nFigure 4, regarding a feature map with temporal dimension Tk, the basic temporal span for each\nfeature unit within this feature map is 1/Tk. We impose different scale ratios based on the basic span,\nand denote them as r \u2208 R = {0.25, 0, 5, 0.75, 1.0}. As such, for the i-th feature unit of the feature\nmap, we can compute the length of the scaled spans within it as r/Tk, and the center of these spans is\n(i + 0.5)/Tk. For the whole feature map, there are a total number of Tk \u00b7 |R| scaled spans within it,\nwith each span corresponding to a candidate video segment for grounding.\nThen, we impose an additional set of convolution op-\nerations on the layer-wise temporal feature maps to\npredict the target video segment position. Specif-\nically, each candidate segment will be associated\nwith a prediction vector p = (pover,(cid:52)c,(cid:52)w), where\npover is the predicted overlap score between the can-\ndidate and ground-truth segment, and (cid:52)c and (cid:52)w\nare the temporal center and width offsets of the can-\ndidate segment relative to the ground-truth. Suppose\nthat the center and width for a candidate segment are \u00b5c and \u00b5w, respectively. Then the center \u03c6c and\nwidth \u03c6w of the corresponding predicted segment are therefore determined by:\n\u03c6w = \u00b5w \u00b7 exp(\u03b1w \u00b7 (cid:52)w),\n\n(5)\nwhere \u03b1c and \u03b1w both are used for controlling the effect of location offsets to make location prediction\nstable, which are set as 0.1 empirically. As such, for a feature map with temporal dimension Tk, we\ncan obtain a predicted segment set \u03a6k = {(pover\n. The total predicted segment set is\ntherefore denoted as \u03a6 = {\u03a6k}K\n\nFigure 4: The illustration of temporal scale ratios\nand offsets.\n\nk=1, where K is the number of temporal feature maps.\n\n\u03c6c = \u00b5c + \u03b1c \u00b7 \u00b5w \u00b7 (cid:52)c,\n\n, \u03c6c\n\nj, \u03c6w\n\nj )}Tk\u00b7|R|\n\nj=1\n\nj\n\n3.4 Training and Inference\n\nTraining: Our training sample consists of three elements: an input video, a sentence query, and the\nground-truth segment. We treat candidate segments within different temporal feature maps as positive\nif their tIoUs (temporal Intersection-over-Union) with ground-truth segments are larger than 0.5. Our\ntraining objective includes an overlap prediction loss Lover and a location prediction loss Lloc. The\nLover term is realized as a cross-entropy loss, which is de\ufb01ned as:\n\ngover\ni\n\nlog(pover\n\ni\n\n) + (1 \u2212 gover\n\ni\n\n) log(1 \u2212 pover\n\ni\n\n),\n\n(6)\n\n(cid:88)\n\nLover =\n\nz\u2208{pos,neg}\n\nNz(cid:88)\n\n\u2212 1\nNz\n\ni\n\n5\n\nCenter = (i+0.5)/Tk , Width = r/Tk1.00.750.50.25Feature map with temporal dimension Tk (Tk=4 here)Temporal scale ratios rTarget segment(pover,cw),\fwhere gover is the ground-truth tIoU between the candidate and target segments, and pover is the\npredicted overlap score. The Lloc term measures the Smooth L1 loss [13] for positive samples:\n\nNpos(cid:88)\n\ni\n\nLloc =\n\n1\n\nNpos\n\nSL1(gc\n\ni \u2212 \u03c6c\n\ni ) + SL1(gw\n\ni \u2212 \u03c6w\ni ),\n\n(7)\n\nwhere gc and gw are the center and width of the ground-truth segment, respectively.\nThe two losses are jointly considered for training our proposed model, with \u03bb and \u03b7 balancing their\ncontributions:\n\nLall = \u03bbLover + \u03b7Lloc.\n\n(8)\n\nInference: The predicted segment set \u03a6 of different temporal granularities can be generated in one\nforward pass. All the predicted segments within \u03a6 will be ranked and re\ufb01ned with non maximum\nsuppression (NMS) according to the predicted pover scores. Afterwards, the \ufb01nal temporal grounding\nresult is obtained.\n\n4 Experiments\n\n4.1 Datasets and Evaluation Metrics\n\nWe validate the performance of our proposed model on three public datasets for the TSG task: TACoS\n[24], Charades-STA [10], and AcitivtyNet Captions [17]. The TACoS dataset mainly contains videos\ndepicting human\u2019s cooking activities, while Charades-STA and ActivityNet Captions focus on more\ncomplicated human activities in daily life.\nFor fair comparisons, we adopt \u201cR@n, IoU@m\u201d as our evaluation metrics as in previous works\n[2, 10, 36, 32, 19]. Speci\ufb01cally, \u201cR@n, IoU@m\u201d is de\ufb01ned as the percentage of the testing queries\nhaving at least one hitting retrieval (with IoU larger than m) in the top-n retrieved segments.\n\n4.2\n\nImplementation Details\n\nFollowing the previous methods, 3D convolutional features (C3D [28] for TACoS and ActivityNet,\nand I3D [1] for Charades-STA) are extracted to encode videos, with each feature representing a\n1-second video clip. According to the video duration statistics, the length of input video clips is\nset as 1024 for both ActivityNet Captions and TACoS, and 64 for Charades-STA to accommodate\nthe temporal convolution. Longer videos are truncated, and shorter ones are padded with zero\nvectors. For the design of temporal convolutional layers, 6 layers with {32, 16, 8, 4, 2, 1} temporal\ndimensions, 6 layers with {512, 256, 128, 64, 32, 16} temporal dimensions, and 8 layers with {512,\n256, 128, 64, 32, 16, 8, 4} temporal dimensions are set for Charades-STA, TACoS, and ActivityNet\nCaptions, respectively. All the \ufb01rst temporal feature maps will not be used for location prediction,\nbecause the receptive \ufb01elds of the corresponding feature units are too small and are too rare to contain\ntarget activities. To save model memory footprint, the SCDM mechanism is only performed on the\nfollowing temporal feature maps which directly serve for position prediction. For sentence encoding,\nwe \ufb01rst embed each word in sentences with the Glove [23], and then employ a Bi-directional GRU to\nencode the word embedding sequence. As such, words in sentences are \ufb01nally represented with their\ncorresponding GRU hidden states. Hidden dimension of the sentence Bi-directional GRU, dimension\nof the multimodal fused features df , and the \ufb01lter number dh for temporal convolution operations are\nall set as 512 in this paper. The trade-off parameters of the two loss terms \u03bb and \u03b7 are set as 100 and\n10, respectively.\n\n4.3 Compared Methods\n\nWe compare our proposed model with the following state-of-the-art baseline methods on the TSG\ntask. CTRL [10]: Cross-model Temporal Regression Localizer. ACRN [19]: Attentive Cross-Model\nRetrieval Network. TGN [2]: Temporal Ground-Net. MCF [32]: Multimodal Circulant Fusion. ACL\n[12]: Activity Concepts based Localizer. SAP [4]: A two-stage approach based on visual concept\nmining. Xu et al. [33]: A two-stage method (proposal generation + proposal rerank) exploiting\nsentence re-construction. MAN [36]: Moment Alignment Network. We use Ours-SCDM to refer\nour temporal convolutional network coupled with the proposed SCDM mechanism.\n\n6\n\n\fTable 1: Performance comparisons on the TACoS and Charades-STA datasets (%).\n\nMethod\n\nTACoS\n\nCharades-STA\n\nR@1,\n\nIoU@0.3\n\nR@1,\n\nIoU@0.5\n\nR@5,\n\nIoU@0.3\n\nR@5,\n\nIoU@0.5\n\nR@1,\n\nIoU@0.5\n\nR@1,\n\nIoU@0.7\n\nR@5,\n\nIoU@0.5\n\nR@5,\n\nIoU@0.7\n\n23.63\n\n8.89\n\n58.92\n\n29.52\n\n-\n-\n\n-\n\n27.42\n30.48\n\n-\n-\n\n-\n\n13.36\n12.20\n\n-\n-\n\n-\n\n66.37\n64.84\n\n-\n-\n\n-\n\n38.15\n35.13\n\nCTRL (C3D) [10]\nMCF (C3D) [32]\nACRN (C3D) [19]\nSAP (VGG) [4]\nACL (C3D) [12]\nTGN (C3D) [2]\n\n18.32\n18.64\n19.52\n\n24.17\n21.77\n\n-\n\n-\n-\n\n13.30\n12.53\n14.62\n18.24\n20.01\n18.90\n\n36.69\n37.13\n34.97\n\n42.15\n39.06\n\n-\n\n-\n-\n\n25.42\n24.73\n24.88\n28.11\n30.66\n31.02\n\nXu et al. (C3D) [33]\n\nMAN (I3D) [36]\nOurs-SCDM (*)\n\n45.40\n53.72\n58.08\n*: We adopt C3D [28] features to encode videos on the TACoS and ActivityNet Captions datasets, and I3D [1] features on the\nCharades-STA dataset for fair comparisons. Video features adopted by other compared methods are indicated in brackets. VGG\ndenotes VGG16 [25] features.\n\n79.40\n86.23\n74.43\n\n15.80\n22.72\n33.43\n\n35.60\n46.53\n54.44\n\n32.18\n\n26.11\n\n21.17\n\n40.16\n\n-\n- -\n\n-\n-\n\nTable 2: Performance comparisons on the ActivityNet Captions dataset (%).\nR@5,IoU@0.5\n\nR@1,IoU@0.3\n\nR@1,IoU@0.5\n\nR@1,IoU@0.7\n\nR@5,IoU@0.3\n\nMethod\n\nTGN (INP*) [2]\n\nXu et al. (C3D) [33]\nOurs-SCDM (C3D)\n*: INP denotes Inception-V4 [27] features.\n\n45.51\n45.30\n54.80\n\n28.47\n27.70\n36.75\n\n-\n\n13.60\n19.86\n\n57.32\n75.70\n77.29\n\n43.33\n59.20\n64.99\n\nR@5,IoU@0.7\n\n-\n\n38.30\n41.53\n\n4.4 Performance Comparison and Analysis\n\nTable 1 and Table 2 report the performance comparisons between our model and the existing methods\non the aforementioned three public datasets. Overall, Ours-SCDM achieves the highest temporal\nsentence grounding accuracy, demonstrating the superiority of our proposed model. Notably, for\nlocalizing complex human activities in Charades-STA and ActivityNet Captions datasets, Ours-\nSCDM signi\ufb01cantly outperforms the state-of-the-art methods with 10.71% and 6.26% absolute\nimprovements in the R@1,IoU@0.7 metrics, respectively. Although Ours-SCDM achieves lower\nresults of R@5,IoU@0.5 on the Charades-STA dataset, it is mainly due to the biased annotations in\nthis dataset. For example, in Charades-STA, the annotated ground-truth segments are 10s on average\nwhile the video duration is only 30s on average. Randomly selecting one candidate segment can also\nachieve competing temporal grounding results. It indicates that the Recall values under higher IoUs\nare more stable and convincing even considering the dataset biases. The performance improvements\nunder the high IoU threshold demonstrate that Ours-SCDM can generate grounded video segments of\nmore precise boundaries. For TACoS, the cooking activities take place in the same kitchen scene\nwith some slightly varied cooking objects (e.g., chopping board, knife, and bread, as shown in the\nsecond example of Figure 5). Thus, it is hard to localize such \ufb01ne-grained activities. However, our\nproposed model still achieves the best results, except slight worse performances in R@5,IoU@0.3.\nThe main reasons for our proposed model outperforming the competing models lie in two folds. First,\nthe sentence information is fully leveraged to modulate the temporal convolution processes, so as\nto help correlate and compose relevant video contents over time to support the temporal boundary\nprediction. Second, the modulation procedure dynamically evolves with different video contents in the\nhierarchical temporal convolution architecture, and therefore characterizes the diverse sentence-video\nsemantic interactions of different granularities.\n\n4.5 Ablation Studies\n\nIn this section, we perform ablation studies to examine the contributions of our proposed SCDM.\nSpeci\ufb01cally, we re-train our model with the following four settings.\n\u2022 Ours-w/o-SCDM: SCDM is replaced by the plain batch normalization [15].\n\u2022 Ours-FC: Instead of performing SCDM, one FC layer is used to fuse each temporal feature unit\n\u2022 Ours-MUL: Instead of performing SCDM, element-wise multiplication between each temporal\nfeature unit and the global sentence representation s is performed after each temporal convolutional\nlayer.\n\u2022 Ours-SCM: We use the global sentence representation s to produce \u03b3c and \u03b2c without dynamically\n\nwith the global sentence representation s after each temporal convolutional layer.\n\nchanging these two modulation vectors with respect to different feature units.\n\n7\n\n\fTable 3 shows the performance comparisons of our proposed full model Ours-SCDM w.r.t. these\nablations on the Charades-STA dataset (please see results on the other datasets in our supplemental\nmaterial). Without considering SCDM, the performance of the model Ours-w/o-SCDM degenerates\ndramatically. It indicates that only relying on multimodal fusion to exploit the relationship between\nvideo and sentence is not enough for the TSG task. The critical sentence semantics should be\nintensi\ufb01ed to guide the temporal convolution procedure so as to better link the sentence-related video\ncontents over time. However, roughly introducing sentence information in the temporal convolution\narchitecture like Ours-MUL and Ours-FC does not achieve satisfying results. Recall that temporal\nfeature maps in the proposed model are already multimodal representations since the sentence\ninformation has been integrated during the multimodal fusion process. Directly coupling the global\nsentence representation s with temporal feature units could possibly disrupt the visual correlations\nand temporal dependencies of the videos, which poses a negative effect on the temporal sentence\ngrounding performance. In contrast, the proposed SCDM mechanism modulates the temporal feature\nmaps by manipulating their scaling and shifting parameters under the sentence guidance, which is\nlightweight while meticulous, and still achieves the best results.\nIn addition, comparing Ours-SCM with Ours-\nSCDM, we can \ufb01nd that dynamically chang-\ning the modulation vectors \u03b3c and \u03b2c with\nrespect to different temporal feature units is\nbene\ufb01cial, with R@5,IoU@0.7 increasing\nfrom 54.57% of Ours-SCM to 58.08% of\nOurs-SCDM. The SCDM intensi\ufb01es mean-\ningful words and cues in sentences catering\nfor different temporal feature units, with the\nmotivation that different video segments may contain diverse visual contents and express different\nsemantic meanings. Establishing the semantic interaction between these two modalities in a dynamic\nway can better align the semantics between sentence and diverse video contents, yielding more\nprecise temporal boundary predictions.\n\nTable 3: Ablation studies on the Charades-STA dataset (%).\n\nOurs-FC\nOurs-MUL\nOurs-SCM\nOurs-SCDM\n\n26.91\n25.94\n28.77\n31.41\n33.43\n\n69.85\n68.96\n72.68\n71.71\n74.43\n\n47.52\n46.33\n49.08\n53.07\n54.44\n\n49.35\n49.81\n51.02\n54.57\n58.08\n\nOurs-w/o-SCDM\n\nR@1,\n\nIoU@0.5\n\nR@1,\n\nIoU@0.7\n\nR@5,\n\nIoU@0.5\n\nMethod\n\nR@5,\n\nIoU@0.7\n\n4.6 Model Ef\ufb01ciency Comparison\n\nMethod\n\n2.23s\n4.31s\n0.78s\n\nCTRL [10]\nACRN [19]\nOurs-SCDM\n\nRun-Time Model Size Memory Footprint\n\nTable 4: Comparison of model running ef\ufb01ciency,\nmodel size and memory footprint\n\nTable 4 shows the run-time ef\ufb01ciency, model\nsize (#param), and memory footprint of dif-\nferent methods. Speci\ufb01cally, \u201cRun-Time\u201d de-\nnotes the average time to localize one sen-\ntence in a given video. The methods with\nreleased codes are run with one Nvidia TI-\nTAN XP GPU. The experiments are run on\nthe TACoS dataset since the videos in this\ndataset are relatively long (7 minutes on av-\nerage), and are appropriate to evaluate the temporal grounding ef\ufb01ciency of different methods. It\ncan be observed that ours-SCDM achieves the fastest run-time with the smallest model size. Both\nCTRL and ACRN methods need to sample candidate segments with various sliding windows in\nthe videos \ufb01rst, and then match the input sentence with each of the segments individually. Such a\ntwo-stage architecture will inevitably in\ufb02uence the temporal sentence grounding ef\ufb01ciency, since\nthe matching procedure through sliding window is quite time-consuming. In contrast, Ours-SCDM\nadopts a hierarchical convolution architecture, and naturally covers multi-scale video segments for\ngrounding with multi-layer temporal feature maps. Thus, we only need to process the video in one\npass of temporal convolution and then get the TSG results, and achieve higher ef\ufb01ciency. In addition,\nSCDM only needs to control the feature normalization parameters and is lightweight towards the\noverall convolution architecture. Therefore, ours-SCDM also has smaller model size.\n\n725MB\n8537MB\n4533MB\n\n22M\n128M\n15M\n\n4.7 Qualitative Results\n\nSome qualitative examples of our model are illustrated in Figure 5. Evidently, our model can produce\naccurate segment boundaries for the TSG task. Moreover, we also visualize the attention weights\n(de\ufb01ned in Equation (2)) produced by SCDM when it processes different temporal units. It can be\nobserved that different video contents attentively trigger different words in sentences so as to better\nalign their semantics. For example, in the \ufb01rst example, the words \u201cwalking\u201d and \u201copen\u201d obtain\n\n8\n\n\fFigure 5: Qualitative prediction examples of our proposed model. The rows with green background show\nthe ground-truths for the given sentence queries, and the rows with blue background show the \ufb01nal location\nprediction results. The gray histograms show the word attention weights produced by SCDM at different\ntemporal regions.\n\nFigure 6:\nt-SNE projections of temporal feature maps yielded by the models Ours-w/o-SCDM and Ours-\nSCDM. Each temporal feature unit within these feature maps is represented by its corresponding video clip in\nthe original video. Video clips marked with red color are within ground-truth video segments.\n\nhigher attention weights in the ground-truth segment since the described action indeed happens there.\nWhile in the other region, the word attention weights are more inclined to be an even distribution.\nIn order to gain more insights of our proposed SCDM mechanism, we visualize the temporal feature\nmaps produced by the variant model Ours-w/o-SCDM and the full-model Ours-SCDM. For both of\nthe trained models, we extract their temporal feature maps, and subsequently apply t-SNE [22] to\neach temporal feature unit within these maps. Since each temporal feature unit corresponds to one\nspeci\ufb01c location in the original video, we then assign the corresponding video clips to the positions\nof these feature units in the t-SNE embedded space. As illustrated in Figure 6, temporal feature\nmaps of two testing videos are visualized, where the video clips marked with red color denote the\nground-truth segments of the given sentence queries. Interestingly, it can be observed that through\nSCDM processing, video clips within ground-truth segments are more tightly grouped together.\nIn contrast, the clips without SCDM processing are separated in the learned feature space. This\ndemonstrates that SCDM successfully associates the sentence-related video contents according to the\nsentence semantics, which is bene\ufb01cial to the later temporal boundary predictions. More visualization\nresults are provided in the supplemental material.\n5 Conclusion\n\nIn this paper, we proposed a novel semantic conditioned dynamic modulation mechanism for tackling\nthe TSG task. The proposed SCDM leverages the sentence semantics to modulate the temporal\nconvolution operations to better correlate and compose the sentence-related video contents over time.\nAs SCDM dynamically evolves with the diverse video contents of different temporal granularities in\nthe temporal convolution architecture, the sentence described video contents are tightly correlated\nand composed, leading to more accurate temporal boundary predictions. The experimental results\nobtained on three widely-used datasets further demonstrate the superiority of the proposed SCDM on\nthe TSG task.\n\n9\n\nPredictionGround-TruthSentence query: Person walking across the room to open a cabinet door.23.8s31.0s23.7s30.5s5Sentence query: The person returns the bread to the fridge.55.8s64.0s57.2s63.7sPredictionGround-TruthSentence query: Person start getting dressedOurs-w/o-SCDMSentence query: The boy closes the cabinet door(a)(b)272829262312247293027151718141211169131968202401323521222404368101111315171819212226262724282930252141610023456789101113141516171819233031120322625212201235789101514121316171819242520422621292827303231112328252057129Ours-SCDMOurs-w/o-SCDMOurs-SCDM\f6 Acknowledgement\n\nThis work was supported by National Natural Science Foundation of China Major Project\nNo.U1611461 and Shenzhen Nanshan District Ling-Hang Team Grant under No.LHTD20170005.\n\nReferences\n[1] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299\u20136308,\n2017.\n\n[2] Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. Temporally grounding natural\nsentence in video. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language\nProcessing, pages 162\u2013171, 2018.\n\n[3] Jingyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, and Jiebo Luo. Localizing natural language in videos.\n\nIn Thirty-Third AAAI Conference on Arti\ufb01cial Intelligence, 2019.\n\n[4] Shaoxiang Chen and Yu-Gang Jiang. Semantic proposal for activity localization in videos via sentence\n\nquery. In AAAI Conference on Arti\ufb01cial Intelligence. IEEE, 2019.\n\n[5] Zhenfang Chen, Lin Ma, Wenhan Luo, and Kwan-Yee K Wong. Weakly-supervised spatio-temporally\ngrounding natural sentence in video. In The 57th Annual Meeting of the Association for Computational\nLinguistics (ACL), 2019.\n\n[6] Harm de Vries, Florian Strub, Jeremie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron C. Courville.\nModulating early visual processing by language. Neural Information Processing Systems, pages 6594\u20136604,\n2017.\n\n[7] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style.\n\nInternational Conference on Learning Representations, 2017.\n\n[8] Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. Spatio-temporal video re-localization by warp lstm. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1288\u20131297,\n2019.\n\n[9] Yang Feng, Lin Ma, Wei Liu, Tong Zhang, and Jiebo Luo. Video re-localization. In Proceedings of the\n\nEuropean Conference on Computer Vision (ECCV), pages 51\u201366, 2018.\n\n[10] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language\nquery. In Proceedings of the IEEE International Conference on Computer Vision, pages 5267\u20135275, 2017.\n\n[11] Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees G. M. Snoek. Actor and action video segmentation\nfrom a sentence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 5958\u20135966, 2018.\n\n[12] Runzhou Ge, Jiyang Gao, Kan Chen, and Ram Nevatia. Mac: Mining activity concepts for language-based\nIn 2019 IEEE Winter Conference on Applications of Computer Vision, pages\n\ntemporal localization.\n245\u2013253. IEEE, 2019.\n\n[13] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision,\n\npages 1440\u20131448, 2015.\n\n[14] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell.\nLocalizing moments in video with natural language. In Proceedings of the IEEE International Conference\non Computer Vision, 2017.\n\n[15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In Proceedings of the International Conference on Machine Learning, pages\n448\u2013456, 2015.\n\n[16] Thomas N. Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nInternational Conference on Learning Representations, 2017.\n\n[17] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in\nvideos. In Proceedings of the IEEE International Conference on Computer Vision, pages 706\u2013715, 2017.\n\n10\n\n\f[18] Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot temporal action detection. In Proceedings of the 25th\n\nACM international conference on Multimedia, pages 988\u2013996, 2017.\n\n[19] Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. Attentive moment\nretrieval in videos. In The 41st International ACM SIGIR Conference on Research & Development in\nInformation Retrieval, pages 15\u201324. ACM, 2018.\n\n[20] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and\nAlexander C. Berg. Ssd: Single shot multibox detector. European Conference on Computer Vision, pages\n21\u201337, 2016.\n\n[21] Yuan Liu, Lin Ma, Yifeng Zhang, Wei Liu, and Shih-Fu Chang. Multi-granularity generator for temporal\naction proposal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 3604\u20133613, 2019.\n\n[22] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning\n\nresearch, 9(Nov):2579\u20132605, 2008.\n\n[23] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representa-\ntion. In Proceedings of the 2014 conference on empirical methods in natural language processing, pages\n1532\u20131543, 2014.\n\n[24] Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal.\nGrounding action descriptions in videos. Transactions of the Association for Computational Linguistics,\n1:25\u201336, 2013.\n\n[25] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. In ICLR 2015 : International Conference on Learning Representations 2015, 2015.\n\n[26] Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao. A multi-stream bi-directional\nrecurrent neural network for \ufb01ne-grained action detection. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 1961\u20131970, 2016.\n\n[27] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-\nresnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Arti\ufb01cial\nIntelligence, 2017.\n\n[28] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal\nfeatures with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer\nVision, pages 4489\u20134497, 2015.\n\n[29] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann Lecun, and Manohar Paluri. A closer look at\nspatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 6450\u20136459, 2018.\n\n[30] Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition and detection by combining motion and\n\nappearance features. THUMOS14 Action Recognition Challenge, 1(2):2, 2014.\n\n[31] Shuohang Wang and Jing Jiang. Machine comprehension using match-lstm and answer pointer. arXiv\n\npreprint arXiv:1608.07905, 2016.\n\n[32] Aming Wu and Yahong Han. Multi-modal circulant fusion for video-to-language and backward. In\n\nInternational Joint Conference on Arti\ufb01cial Intelligence, pages 1029\u20131035, 2018.\n\n[33] Huijuan Xu, Kun He, L Sigal, S Sclaroff, and K Saenko. Multilevel language and vision integration for\n\ntext-to-clip retrieval. In AAAI Conference on Arti\ufb01cial Intelligence, volume 2, page 7, 2019.\n\n[34] Jun Yuan, Bingbing Ni, Xiaokang Yang, and Ashraf A Kassim. Temporal action localization with pyramid\nof score distribution features. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 3093\u20133102, 2016.\n\n[35] Yitian Yuan, Lin Ma, and Wenwu Zhu. Sentence speci\ufb01ed dynamic video thumbnail generation. In 27th\n\nACM International Conference on Multimedia., 2019.\n\n[36] Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S Davis. Man: Moment alignment network\nfor natural language moment retrieval via iterative graph adjustment. arXiv preprint arXiv:1812.00087,\n2018.\n\n11\n\n\f", "award": [], "sourceid": 294, "authors": [{"given_name": "Yitian", "family_name": "Yuan", "institution": "Tsinghua University"}, {"given_name": "Lin", "family_name": "Ma", "institution": "Tencent AI Lab"}, {"given_name": "Jingwen", "family_name": "Wang", "institution": "Tencent AI Lab"}, {"given_name": "Wei", "family_name": "Liu", "institution": "Tencent AI Lab"}, {"given_name": "Wenwu", "family_name": "Zhu", "institution": "Tsinghua University"}]}