{"title": "Attention in Convolutional LSTM for Gesture Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1953, "page_last": 1962, "abstract": "Convolutional long short-term memory (LSTM) networks have been widely used for action/gesture recognition, and different attention mechanisms have also been embedded into the LSTM or the convolutional LSTM (ConvLSTM) networks. Based on the previous gesture recognition architectures which combine the three-dimensional convolution neural network (3DCNN) and ConvLSTM, this paper explores the effects of attention mechanism in ConvLSTM. Several variants of ConvLSTM are evaluated: (a) Removing the convolutional structures of the three gates in ConvLSTM, (b) Applying the attention mechanism on the input of ConvLSTM, (c) Reconstructing the input and (d) output gates respectively with the modified channel-wise attention mechanism. The evaluation results demonstrate that the spatial convolutions in the three gates scarcely contribute to the spatiotemporal feature fusion, and the attention mechanisms embedded into the input and output gates cannot improve the feature fusion. In other words, ConvLSTM mainly contributes to the temporal fusion along with the recurrent steps to learn the long-term spatiotemporal features, when taking as input the spatial or spatiotemporal features. On this basis, a new variant of LSTM is derived, in which the convolutional structures are only embedded into the input-to-state transition of LSTM. The code of the LSTM variants is publicly available.", "full_text": "Attention in Convolutional LSTM for Gesture\n\nRecognition\n\nLiang Zhang\u2217\nXidian University\n\nGuangming Zhu\u2217\nXidian University\n\nliangzhang@xidian.edu.cn\n\ngmzhu@xidian.edu.cn\n\nLin Mei\n\nXidian University\n\nPeiyi Shen\n\nXidian University\n\nl_mei72@hotmail.com\n\npyshen@xidian.edu.cn\n\nSyed Afaq Ali Shah\n\nCentral Queensland University\n\nMohammed Bennamoun\n\nUniversity of Western Australia\n\nafaq.shah@uwa.edu.au\n\nmohammed.bennamoun@uwa.edu.au\n\nAbstract\n\nConvolutional long short-term memory (LSTM) networks have been widely used\nfor action/gesture recognition, and different attention mechanisms have also been\nembedded into the LSTM or the convolutional LSTM (ConvLSTM) networks.\nBased on the previous gesture recognition architectures which combine the three-\ndimensional convolution neural network (3DCNN) and ConvLSTM, this paper\nexplores the effects of attention mechanism in ConvLSTM. Several variants of Con-\nvLSTM are evaluated: (a) Removing the convolutional structures of the three gates\nin ConvLSTM, (b) Applying the attention mechanism on the input of ConvLSTM,\n(c) Reconstructing the input and (d) output gates respectively with the modi\ufb01ed\nchannel-wise attention mechanism. The evaluation results demonstrate that the\nspatial convolutions in the three gates scarcely contribute to the spatiotemporal\nfeature fusion, and the attention mechanisms embedded into the input and output\ngates cannot improve the feature fusion. In other words, ConvLSTM mainly con-\ntributes to the temporal fusion along with the recurrent steps to learn the long-term\nspatiotemporal features, when taking as input the spatial or spatiotemporal features.\nOn this basis, a new variant of LSTM is derived, in which the convolutional struc-\ntures are only embedded into the input-to-state transition of LSTM. The code of\nthe LSTM variants is publicly available2.\n\n1\n\nIntroduction\n\nLong short-term memory (LSTM) [1] recurrent neural networks are widely used to process sequential\ndata [2]. Several variants of LSTM have been proposed since its inception in 1995 [3]. By extending\nthe fully connected LSTM (FC-LSTM) to have convolutional structures in both the input-to-state\nand state-to-state transitions, Shi et al. [4] proposed the convolutional LSTM (ConvLSTM) network\nto process sequential images for precipitation nowcasting. Thereafter, ConvLSTM has been used\nfor action recognition [5, 6], gesture recognition [7\u20139] and in other \ufb01elds [10\u201312]. When LSTM is\nused to process video or sequential images, the spatial features of two-dimensional convolutional\n\n\u2217Equal Contribution\n2https://github.com/GuangmingZhu/AttentionConvLSTM\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fneural networks (2DCNN) are generally vectorized before feeding them as input of LSTM [13, 14].\nHowever, the two-dimensional spatial feature maps can be fed into ConvLSTM directly, without the\nloss of the spatial correlation information. For example, the spatial feature maps of AlexNet/VGG-16\n[5, 10] or the spatiotemporal feature maps of three-dimensional CNN (3DCNN) [7, 8] are used as\ninput of ConvLSTM. ConvLSTM was originally proposed to take images as input for precipitation\nnowcasting, the spatial convolutions are therefore necessary to learn the spatiotemporal features.\nHowever, how much do the convolutional structures of ConvLSTM contribute to the feature fusion\nwhen ConvLSTM takes as input the spatial convolutional features instead of images? Is it necessary\nto have different gate values for each element of the feature maps in the spatial domain?\nThe effect of the convolutional structures in ConvLSTM can be analyzed in three cases. (a) ConvL-\nSTM takes original images as input. In this case, the convolutional structures are crucial to learn the\nspatiotemporal features, as veri\ufb01ed in [4]. (b) ConvLSTM takes the feature maps of 2DCNN as input.\nIn this case, the effect of the convolutional structures is not always remarkable. Intuitively, the three\ngates of ConvLSTM can be viewed as the weighting mechanism for the feature map fusion. However,\nthe different gates values for each element of the feature maps in the spatial domain seemingly do not\nhave the function of spatial attention. Therefore, the soft attention mechanism [15] is additionally\nintroduced into the input of ConvLSTM in [5], in order to make ConvLSTM focus on the noticeable\nspatial features. The improvement (as illustrated in Table 1 of [5]) caused by the attention mechanism\non the input can also verify the above claim in some degree. (c) ConvLSTM takes the feature maps\nof 3DCNN as input. Since the 3DCNN networks have learnt the spatiotemporal features, the gates of\nConvLSTM are more unlikely to have the function of spatial attention. The last case will be analyzed\nthoroughly in this paper.\nBased on our previous published \"3DCNN+ConvLSTM+2DCNN\" architecture [8], we construct a\npreliminary \"Res3D+ConvLSTM+MobileNet\" architecture and derive four variants of the ConvLSTM\ncomponent. In the preliminary \"Res3D+ConvLSTM+MobileNet\" architecture, the blocks 1-4 of\nRes3D [16] are used \ufb01rst to learn the local short-term spatiotemporal feature maps which have a\nrelatively large spatial size. Then, two ConvLSTM layers are stacked to learn the global long-term\nspatiotemporal feature maps. Finally, parts of MobileNet [17] are used to learn deeper features\nbased on the learnt two-dimensional spatiotemporal feature maps. The Res3D and MobileNet blocks\nare \ufb01xed, and the ConvLSTM component is modi\ufb01ed to derive four variants: (a) Removing the\nconvolutional structures of the gates by performing the spatial global average pooling on the input and\nthe hidden states ahead. This means that the convolutional operations in the three gates are reduced\nto the fully-connected operations. The convolutional structures for the input-to-state transition are\nreserved to learn the spatiotemporal features. (b) Applying the soft attention mechanism to the input\n(i.e., the feature maps of the Res3D block) of ConvLSTM. (c) Reconstructing the input gate using\nthe channel-wise attention mechanism. (d) Reconstructing the output gate using the channel-wise\nattention mechanism.\nWe do not re-evaluate the cases that ConvLSTM takes as input images or features of 2DCNN in\nthis paper, since the experiments in [4] and [5] can demonstrate the aforementioned claims. We\nfocus on the evaluation of the third case on the large-scale isolated gesture datasets Jester [18] and\nIsoGD [19], since the \"3DCNN+ConvLSTM+2DCNN\" architecture was originally proposed for\ngesture recognition. Experimental results demonstrate that neither the convolutional structures in the\nthree gates of ConvLSTM nor the extra spatial attention mechanisms contribute in the performance\nimprovements, given the fact that the input spatiotemporal features of 3DCNN have paid attention to\nthe noticeable spatial features. The exploring on the attention in ConvLSTM leads to a new variant of\nLSTM, which is different from the FC-LSTM and ConvLSTM. Speci\ufb01cally, the variant only brings\nthe spatial convolutions to the input-to-state transition, and keeps the gates the same as the gates of\nFC-LSTM.\n\n2 Attention in ConvLSTM\n\nTo ensure the completeness of the paper, the preliminary \"Res3D+ConvLSTM+MobileNet\" architec-\nture is \ufb01rst described. Then, the variants of ConvLSTM are elaborated and analyzed.\n\n2\n\n\fFigure 1: An overview of the \"Res3D+ConvLSTM+MobileNet\" architecture. The output of each\nblock is in the format of \"Length*Width*Height*Channel\". MobileNet processes each temporal\nsample independently.\n\n2.1 The preliminary architecture\n\nTwo-streams or 3DCNN based networks are widely used for action recognition, such as the famous\nTSN [20], C3D [21], Res3D [16], and I3D [22] networks. Gesture recognition is different from\naction recognition. You cannot tell the categories of the dynamic gestures when you only look at\nan image once. But, you may tell when you just look at an image of actions, under the hints of\nthe backgrounds, objects and postures. Therefore, the aforementioned famous networks cannot\nproduce the state-of-the-art performances on gesture recognition, without including multimodal\nfusion. Gestures focus on the local information of hands and the global motions of arms. Thus, we\nuse a shallow 3DCNN to learn the local short-term spatiotemporal features \ufb01rst. The 3DCNN block\ndoes not need to be deep, since it focuses on the local features. Therefore, the modi\ufb01ed blocks 1-4 of\nRes3D are used. The temporal duration (or spatial size) of the outputted feature maps is only shrunk\nby a ratio of 2 (or 4), compared with the inputted images. Then, a two-layer ConvLSTM network is\nstacked to learn the long-term spatiotemporal feature maps. The ConvLSTM network does not shrink\nthe spatial size of the feature maps. Thus, the spatiotemporal feature maps still have a relative large\nspatial size. The top layers of MobileNet, whose inputs have the same spatial size, are further stacked\nto learn deeper features. The comparison with the aforementioned famous networks will be given in\nthe experimental part to demonstrate the advantages of the architecture (as displayed in Fig. 1).\n\n2.2 The variants of ConvLSTM\n\nFormally, ConvLSTM can be formulated as:\n\nit = \u03c3(Wxi \u2217 Xt + Whi \u2217 Ht\u22121 + bi)\nft = \u03c3(Wxf \u2217 Xt + Whf \u2217 Ht\u22121 + bf )\not = \u03c3(Wxo \u2217 Xt + Who \u2217 Ht\u22121 + bo)\nGt = tanh(Wxc \u2217 Xt + Whc \u2217 Ht\u22121 + bc)\n\n(1)\n(2)\n(3)\n(4)\n(5)\n(6)\nwhere \u03c3 is the sigmoid function, Wx\u223c and Wh\u223c are 2-d convolution kernels. The input Xt , the cell\nstate Ct , the hidden state Ht, the candidate memory Gt, and the gates it, ft, ot are all 3D tensors.\nThe symbol \"*\" denotes the convolution operator, and \"o\" denotes the Hadamard product.\nThe input Xt has a spatial size of W \u00d7 H with Cin channels, and ConvLSTM has a convolutional\nkernel size of K \u00d7 K with Cout channels. Thus, the parameter size of ConvLSTM can be calculated\nas3:\n\nCt = ft \u25e6 Ct\u22121 + it \u25e6 Gt\n\nHt = ot \u25e6 tanh(Ct)\n\nP aramConvLST M = K \u00d7 K \u00d7 (Cin + Cout) \u00d7 Cout \u00d7 4\n\n(7)\n\nThe parameter size of ConvLSTM is very large, partly due to the convolutional structures. It can be\nconcluded from Eqs. (1)-(6) that the gates it, ft, ot have a spatial size of W \u00d7 H with Cout channels4.\nIt means that the three gates have independent values for each element of the feature maps in the cell\nstate and the candidate memory. In this case, can ConvLSTM focus on the noticeable spatial regions\nwith the help of different gate values in the spatial domain? In order to provide an answer and remove\nany doubt, four variants of ConvLSTM are constructed as follows (as illustrated in Fig. 2).\n\n3The biases are ignored for simplicity.\n4It is assumed that the convolutional structures have the same-padding style.\n\n3\n\n\fFigure 2: An overview of the four variants of ConvLSTM. The \"P&FC\" denotes the spatial global\naverage pooling and fully-connected operations, as expressed in Eqs. (8)-(12). The \"Conv\" denotes\nthe convolutional structure in Eqs. (1)-(4)(13)(17)(21). The \"Atten\" denotes the standard attention\nmechanism in Eqs. (17)-(19). The \"CAtten\" denotes the channel-wise attention in Eqs. (21)-(23).\n\n(a) Removing the convolutional structures of the gates\n\nGiven the local spatiotemporal features of the 3DCNN block, it can be considered that the 3DCNN\nblock has paid attention to the noticeable spatial regions where there is valuable spatiotemporal\ninformation. Therefore, the ConvLSTM block can just focus on the spatiotemporal feature fusion\nalong with the recurrent steps. The gate values are only needed to be calculated for each feature map\nof the states, not for each element. Therefore, a global average pooling is performed on the input\nfeatures and the hidden states to reduce the spatial dimension, so that fully-connected operations can\nbe performed instead of convolutions in the gates. The variant of ConvLSTM can be formulated as:\n\nX t = GlobalAverageP ooling(Xt)\n\nH t\u22121 = GlobalAverageP ooling(Ht\u22121)\n\nit = \u03c3(WxiX t + WhiH t\u22121 + bi)\nft = \u03c3(Wxf X t + Whf H t\u22121 + bf )\not = \u03c3(WxoX t + WhoH t\u22121 + bo)\n\nGt = tanh(Wxc \u2217 Xt + Whc \u2217 Ht\u22121 + bc)\n\nCt = ft \u25e6 Ct\u22121 + it \u25e6 Gt\n\nHt = ot \u25e6 tanh(Ct)\n\n4\n\n(8)\n\n(9)\n(10)\n(11)\n(12)\n(13)\n(14)\n(15)\n\n\fThe gates it, ft and ot are all one-dimensional vectors, so that the elements in each feature map are\nweighted by the same gate value in Eqs. (14)-(15). The convolutional structures in the three gates are\nreduced to fully-connected operations. The convolutional structures for the input-to-state transition\n(as in Eq. (13)) are reserved for the spatiotemporal feature fusion.\nIn order to reduce the numbers of parameters of the input-to-state transition, the depthwise separable\nconvolutions [23] are used. This reduces the parameter size of the variant of ConvLSTM to\n\nP aramConvLST M va = (K \u00d7 K + Cout \u00d7 4) \u00d7 (Cin + Cout)\n\n(16)\n\nThree more variants are constructed based on variant (a), in order to verify whether the spatial\nattention can improve the performances.\n\n(b) Applying the attention mechanism to the inputs\n\nBy referring to [5], we apply the spatial attention mechanism to the inputs before the operations of\nEqs.(8)-(15). Formally, the attention mechanism can be formulated as:\n\nZt = Wz \u2217 tanh(Wxa \u2217 Xt + Wha \u2217 Ht\u22121 + ba)\n\nt = p(attij|Xt, Ht\u22121) =\nAij\n\n(cid:80)\n\n(cid:80)\n\ni\n\nexp(Z ij\nt )\n\nj exp(Z ij\nt )\n\n(17)\n\n(18)\n\n\u02dcXt = At \u25e6 Xt\n\n(19)\nwhere At is a 2-d score map, and Wz is the 2-d convolution kernel with a kernel size of K \u00d7 K \u00d7\nCin \u00d7 1. The variant (b) can be constructed by replacing Xt in Eqs.(8)-(15) with \u02dcXt. The parameter\nsize of this variant can be calculated as\nP aramConvLST M vb = P aramConvLST M va + K \u00d7 K \u00d7 (Cin + Cout \u00d7 2) + (Cin + Cout)\u00d7 Cout\n(20)\n\n(c) Reconstructing the input gate using the channel-wise attention\n\nBoth the gate and the attention mechanisms need to perform convolutions on the input and the hidden\nstates, as expressed in Eqs. (1)-(3) and Eq. (17). Does this mean that the gate mechanism has\nthe function of attention implicitly? The answer is no. The independent gate values in the spatial\ndomain of the feature maps cannot ensure the attention effect as expressed in Eq. (18). Therefore, we\nreconstruct the input gate according to the attention mechanism. The sigmoid activation function\nmakes the gate values fall in the range 0-1. The division by the sum in Eq. (18) results in attention\nscores whose sum is 1 in each feature channel. This means that the attention scores in each feature\nchannel may be far less than 1, and far less than most of the normal gate values in other gates, given\nthe large spatial size of the input feature maps. Therefore, the attention mechanism needs to be\nmodi\ufb01ed to match the range of the sigmoid function in the gates. Formally, the input gate can be\nreformulated as:\n\nZt = Wi \u2217 tanh(Wxi \u2217 Xt + Whi \u2217 Ht\u22121 + bi)\n\n(21)\n\nAij\n\nt (c) =\n\nexp(Z ij\n\nt (c))\n\nmax\ni,j\n\nexp(Z ij\n\nt (c))\n\n(22)\n\nit = {Aij\n\nt (c) : (i, j, c) \u2208 RW\u00d7H\u00d7Cout}\n\nexp(Z ij\n\n(23)\nwhere Wi is a 2-d convolution kernel with a kernel size of W \u00d7 H and a channel num of Cout. The\n\"max\nt (c))\u201d in Eq. (22) corresponds to the maximum element chosen within the channel c of\ni,j\nZt. In other words, the normalization in Eq. (22) is performed channel-wise. The division by the\nmaximum value instead of the sum ensures that the attention scores are distributed in the range of 0-1.\nVariant (c) of ConvLSTM can be constructed by replacing the input gate of variant (a) with the new\ngate expressed by Eqs. (21)-(23). The parameter size of this variant can be calculated as\nP aramConvLST M vc = P aramConvLST M va + K \u00d7 K \u00d7 (Cin + Cout \u00d7 2) + Cout \u00d7 Cout (24)\n\n5\n\n\f(d) Reconstructing the output gate using the channel-wise attention\n\nVariant (b) of ConvLSTM applies the attention mechanism on the input feature maps, while variant\n(c) applies the attention mechanism on the candidate memory. Finally, variant (d) of ConvLSTM is\nconstructed by applying the attention mechanism on the cell state. In other words, the output gate is\nreconstructed in the same way as the input gate in variant (c). The expressions are similar as in Eqs.\n(21)-(23), and they are thus omitted for simplicity.\n\n3 Experiments\n\nThe case in which ConvLSTM takes features from 2DCNN as input has been evaluated in [5], and\nthe improvement (as illustrated in Table 1 of [5]) caused by the attention mechanism on the input\nfeatures can indicate, in some degree, that the convolutional structures in the gates cannot play\nthe role of spatial attention. Due to page restrictions, this paper only focuses on the evaluation\nof the case in which ConvLSTM takes features from 3DCNN as input. As aforementioned, the\n\"3DCNN+ConvLSTM+2DCNN\" architecture was originally proposed for gesture recognition [8].\nTherefore, the proposed variants of ConvLSTM are evaluated on the large-scale isolated gesture\ndatasets Jester [18] and IsoGD [19] in this paper.\n\n3.1 Datasets\n\nJester[18] is a large collection of densely-labeled video clips. Each clip contains a pre-de\ufb01ned hand\ngesture performed by a worker in front of a laptop camera or webcam. The dataset includes 148,094\nRGB video \ufb01les of 27 kinds of gestures. It is the largest isolated gesture dataset in which each\ncategory has more than 5,000 instances on average. Therefore, this dataset was used to train our\nnetworks from scratch.\nIsoGD[19] is a large-scale isolated gesture dataset which contains 47,933 RGB+D gesture videos\nof 249 kinds of gestures performed by 21 subjects. The dataset has been used in the 2016 [24]\nand 2017 [25] ChaLearn LAP Large-scale Isolated Gesture Recognition Challenges. This paper\nhas the bene\ufb01t that results are compared with the state-of-the-art networks used in the challenges.\nDifferent multi-modal fusion methods were used by the teams in the challenges. In this paper, only\nthe evaluation on each modality is performed (without multi-modal fusion) to verify the advantages\nof the different deep architectures.\n\n3.2\n\nImplementation details\n\nThe base architecture has been displayed in Fig. 1. The Res3D and MobileNet components are\ndeployed from their original versions, except for the aforementioned modi\ufb01cations in Section 2.1.\nThese two components are \ufb01xed among the variants. The \ufb01lter numbers of ConvLSTM and the\nvariants are all set to 256.\nThe networks using the original ConvLSTM or the variants are \ufb01rst trained on the Jester dataset\nfrom scratch, and then \ufb01ne-tuned using the IsoGD dataset to report the \ufb01nal results. For the training\non Jester, the learning rate follows a polynomial decay from 0.001 to 0.000001 within a total of 30\nepochs. The input is 16 video clips, and each clip contains 16 frames with a spatial size of 112 \u00d7 112.\nThe uniform sampling with the temporal jitter strategy [26] is utilized to preprocess the inputs. During\nthe \ufb01ne-tuning with IsoGD, the batch size is set to 8, the temporal length is set to 32, and a total of\n15 epochs are performed for each variant. The top-1 accuracy is used as the metric of evaluation.\nStochastic gradient descent (SGD) is used for training.\n\n3.3 Explorative study\n\nThe networks which use the original ConvLSTM or the four variants as the ConvLSTM component in\nFig. 1 are evaluated on the Jester and IsoGD datasets respectively. The evaluation results are illustrated\nin Table 1. The evaluation on Jester has almost the same accuracy except for variant (b). The similar\nrecognition results on Jester may be caused by the network capacity or the distinguishability of the\ndata, because the validation has a comparable accuracy with the training. The lower accuracy of\nvariant (b) may indicate the uselessness of the extra attention mechanism on the inputs, since the\nlearnt spatiotemporal features of 3DCNN have already paid attention to the noticeable spatial regions.\n\n6\n\n\fTable 1: Comparison among the original ConvLSTM and the four variants. For simplicity, each row\nin the column of \"Networks\" denotes the deep architecture (as displayed in Fig. 1) which takes the\noriginal ConvLSTM or its variant as the ConvLSTM component.\n\nValidating Accuracy(%)\nNetworks\nJester\nConvLSTM 95.11\n95.12\nVariant (a)\n94.18\nVariant (b)\n95.13\nVariant (c)\n95.10\nVariant (d)\n\nIsoGD\n52.01\n55.98\n43.93\n53.27\n54.11\n\nChannel Num Parameter Size Mult-Adds\n\n256\n256\n256\n256\n256\n\n4.719M\n0.529M\n0.667M\n0.601M\n0.601M\n\n3700M\n415M\n522M\n472M\n472M\n\nThe lower accuracy of the variant (b) on IsoGD can also testify this conclusion. The lower accuracy\nmay be due to the additional optimization dif\ufb01culty caused by the extra multiplication operations in\nthe attention mechanism.\nThe comparison on IsoGD shows that variant (a) is superior to the original ConvLSTM, regardless of\nthe recognition accuracy or the parameter size and the computational consumption. The reduction\nof the convolutional structures in the three gates will not reduce the network capacity, but can save\nmemory and computational consumption signi\ufb01cantly. The speci\ufb01c attention mechanism embedded in\nthe input and output gates cannot contribute to the feature fusion, but it just brings extra memory and\ncomputational consumption. These observations demonstrate that the ConvLSTM component only\nneeds to take full use of its advantages on the long-term temporal fusion, when the input features have\nlearnt the local spatiotemporal information. LSTM/RNN has its superiority on the long sequential\ndata processing. The extension from LSTM to ConvLSTM can only increase the dimensionality of\nthe states and memory, and keep the original gate mechanism unchanged.\nThis evaluation leads to a new variant of LSTM (i.e., variant (a) of ConvLSTM), in which the\nconvolutional structures are only introduced into the input-to-state transition, and the gates still have\nthe original fully-connected mechanism . The added convolutional structures make the variant of\nLSTM capable of performing the spatiotemporal feature fusion. The gate mechanism still sticks to its\nown responsibility and superiority for the long-term temporal fusion.\n\n3.4 Comparison with the state-of-the-art\n\nTable 2 shows the comparison results with the state-of-the-art networks on IsoGD. The 2DCNN\nnetworks demonstrate their unbeatable superiority on the image-based applications, and also show\ntheir ability for action recognition with the help of the speci\ufb01c backgrounds and objects. But, they do\nnot keep their unbeatable performances in the case of gesture recognition, where the \ufb01ne-grained\nspatiotemporal features of hands and the global motions of arms do matter. The 3DCNN networks are\ngood at the spatiotemporal feature learning. But, the weakness on long-term temporal fusion restricts\ntheir capabilities. The \"3DCNN+ConvLSTM+2DCNN\" architecture takes full use of the advantages\nof 3DCNN, ConvLSTM and 2DCNN. The proposed variant (a) of ConvLSTM further enhances\nConvLSTM\u2019s ability for spatiotemporal feature fusion, without any additional burden. Therefore, the\nbest recognition results can be obtained by taking full use of the intrinsic advantages of the different\nnetworks. Although the reference [27] reports the state-of-the-art performance on IsoGD, the high\naccuracy is achieved by fusing 12 channels (i.e., global/left/right channels for four modalities). The\nproposed network obtains the best accuracy on each single modality. This exactly demonstrates the\nsuperiority of the proposed architecture.\n\n3.5 Visualization of the feature map fusion\n\nThe reduction of the convolutional structures of the three gates in ConvLSTM brings no side effects\nto spatiotemporal feature map fusion. Fig. 3 displays an example of visualization of the feature map\nfusion along with the recurrent steps. It can be seen from the heat maps that the most active regions\njust re\ufb02ect the hands\u2019 motion trajectories. These are similar to the attention score maps. This also\nindicates that the learnt spatiotemporal features from 3DCNN have paid attention to the noticeable\nspatial regions, and no extra attention mechanism is needed when fusing the long-term spatiotemporal\n\n7\n\n\fTable 2: Comparison with the state-of-the-art networks on the valid set of IsoGD.\n\nAccuracy(%)\n\nDeep Architecture\nResNet50 [27]\nPyramidal C3D [26]\nC3D [28]\nRes3D [29]\n3DCNN+BiConvLSTM+2DCNN[8]\nRes3D+ConvLSTM+MobileNet\nRes3D+ConvLSTM Variant(a)+MobileNet\n\nRGB Depth\n33.22\n27.98\n38.00\n36.58\n40.50\n37.30\n48.44\n45.07\n51.31\n49.81\n51.30\n52.01\n55.98\n53.28\n\nFlow\n46.22\n\n-\n-\n\n44.45\n45.30\n45.59\n46.51\n\nFigure 3: An example of visualization of the feature map fusion in the case of variant (a) of\nConvLSTM along with the recurrent steps. The feature map which has the largest activation sum\namong the 256 channels is visualized.\n\nfeature maps using ConvLSTM. The reduction of the convolutional structures of the three gates in\nConvLSTM makes the variant more applicable for constructing more complex deep architectures,\nsince this variant has fewer parameters and computational consumption.\n\n4 Conclusion\n\nThe effects of attention in Convolutional LSTM are explored in this paper. Our evaluation results and\nprevious published results show that the convolutional structures in the gates of ConvLSTM do not\nplay the role of spatial attention, even if the gates have independent weight values for each element of\nthe feature maps in the spatial domain. The reduction of the convolutional structures in the three gates\nresults in a better accuracy, a lower parameter size and a lower computational consumption. This leads\nto a new variant of LSTM, in which the convolutional structures are only added to the input-to-state\ntransition, and the gates still stick to their own responsibility and superiority for long-term temporal\nfusion. This makes the proposed variant capable of effectively performing spatiotemporal feature\nfusion, with fewer parameters and computational consumption.\n\nAcknowledgments\n\nThis work is partially supported by the National Natural Science Foundation of China under Grant\nNo.61702390, and the Fundamental Research Funds for the Central Universities under Grant\nJB181001.\n\n8\n\n\fReferences\n[1] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[2] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT\n\npress Cambridge, 2016.\n\n[3] Klaus Greff, Rupesh K Srivastava, Jan Koutn\u00edk, Bas R Steunebrink, and J\u00fcrgen Schmidhuber. Lstm: A\nsearch space odyssey. IEEE transactions on neural networks and learning systems, 28(10):2222\u20132232,\n2017.\n\n[4] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolu-\ntional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural\ninformation processing systems (NIPS), pages 802\u2013810, 2015.\n\n[5] Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees GM Snoek. Videolstm convolves,\n\nattends and \ufb02ows for action recognition. Computer Vision and Image Understanding, 166:41\u201350, 2018.\n\n[6] Lei Wang, Yangyang Xu, Jun Cheng, Haiying Xia, Jianqin Yin, and Jiaji Wu. Human action recognition\n\nby learning spatio-temporal features with deep neural networks. IEEE Access, 6:17913\u201317922, 2018.\n\n[7] Guangming Zhu, Liang Zhang, Peiyi Shen, and Juan Song. Multimodal gesture recognition using 3-d\n\nconvolution and convolutional lstm. IEEE Access, 5:4517\u20134524, 2017.\n\n[8] Liang Zhang, Guangming Zhu, Peiyi Shen, Juan Song, Syed Afaq Shah, and Mohammed Bennamoun.\nLearning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In Proceed-\nings of IEEE International Conference on Computer Vision (ICCV), pages 3120\u20133128, 2017.\n\n[9] Huogen Wang, Pichao Wang, Zhanjie Song, and Wanqing Li. Large-scale multimodal gesture segmentation\nand recognition based on convolutional neural networks. In Proceedings of IEEE International Conference\non Computer Vision (ICCV), pages 3138\u20133146, 2017.\n\n[10] Swathikiran Sudhakaran and Oswald Lanz. Convolutional long short-term memory networks for recogniz-\ning \ufb01rst person interactions. In Proceedings of IEEE International Conference on Computer Vision (ICCV),\npages 2339\u20132346, 2017.\n\n[11] Weixin Luo, Wen Liu, and Shenghua Gao. Remembering history with convolutional lstm for anomaly\ndetection. In 2017 IEEE International Conference on Multimedia and Expo (ICME), pages 439\u2013444, 2017.\n\n[12] Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and S Yu Philip. Predrnn: Recurrent neural\nnetworks for predictive learning using spatiotemporal lstms. In Advances in Neural Information Processing\nSystems (NIPS), pages 879\u2013888, 2017.\n\n[13] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan,\nKate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and\ndescription. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR),\npages 2625\u20132634, 2015.\n\n[14] Lionel Pigou, A\u00e4ron Van Den Oord, Sander Dieleman, Mieke Van Herreweghe, and Joni Dambre. Beyond\ntemporal pooling: Recurrence and temporal convolutions for gesture recognition in video. International\nJournal of Computer Vision, 126(2-4):430\u2013439, 2018.\n\n[15] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel,\nand Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In\nInternational Conference on Machine Learning (ICML), pages 2048\u20132057, 2015.\n\n[16] Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. Convnet architecture search for\n\nspatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017.\n\n[17] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco\nAndreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural networks for mobile vision\napplications. arXiv preprint arXiv:1704.04861, 2017.\n\n[18] www.twentybn.com.\n\nhttps://www.twentybn.com/datasets/jester, 2017.\n\nTwentybn\n\njester\n\ndataset:\n\na\n\nhand\n\ngesture\n\ndataset.\n\n[19] Jun Wan, Yibing Zhao, Shuai Zhou, Isabelle Guyon, Sergio Escalera, and Stan Z Li. Chalearn looking\nat people rgb-d isolated and continuous datasets for gesture recognition. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 56\u201364, 2016.\n\n9\n\n\f[20] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal\nsegment networks: Towards good practices for deep action recognition. In European Conference on\nComputer Vision (ECCV), pages 20\u201336. Springer, 2016.\n\n[21] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal\nfeatures with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision\n(ICCV), pages 4489\u20134497. IEEE, 2015.\n\n[22] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset.\nIn 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724\u20134733. IEEE,\n2017.\n\n[23] Lukasz Kaiser, Aidan N Gomez, and Francois Chollet. Depthwise separable convolutions for neural\n\nmachine translation. arXiv preprint arXiv:1706.03059, 2017.\n\n[24] Hugo Jair Escalante, V\u00edctor Ponce-L\u00f3pez, Jun Wan, Michael A Riegler, Baiyu Chen, Albert Clap\u00e9s,\nSergio Escalera, Isabelle Guyon, Xavier Bar\u00f3, P\u00e5l Halvorsen, et al. Chalearn joint contest on multimedia\nchallenges beyond visual analysis: An overview. In 2016 23rd International Conference on Pattern\nRecognition (ICPR), pages 67\u201373, 2016.\n\n[25] Jun Wan, Sergio Escalera, X Baro, Hugo Jair Escalante, I Guyon, M Madadi, J Allik, J Gorbova, and\nG Anbarjafari. Results and analysis of chalearn lap multi-modal isolated and continuous gesture recognition,\nand real versus fake expressed emotions challenges. In Proceedings of IEEE International Conference on\nComputer Vision (ICCV), pages 3189\u20133197, 2017.\n\n[26] Guangming Zhu, Liang Zhang, Lin Mei, Jie Shao, Juan Song, and Peiyi Shen. Large-scale isolated gesture\nrecognition using pyramidal 3d convolutional networks. In 2016 23rd International Conference on Pattern\nRecognition (ICPR), pages 19\u201324, 2016.\n\n[27] Pradyumna Narayana, J. Ross Beveridge, and Bruce A Draper. Gesture recognition: Focus on the hands.\n\nIn 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[28] Yunan Li, Qiguang Miao, Kuan Tian, Yingying Fan, Xin Xu, Rui Li, and Jianfeng Song. Large-scale\ngesture recognition with a fusion of rgb-d data based on the c3d model. In 2016 23rd International\nConference on Pattern Recognition (ICPR), pages 25\u201330, 2016.\n\n[29] Qiguang Miao, Yunan Li, Wanli Ouyang, Zhenxin Ma, Xin Xu, Weikang Shi, Xiaochun Cao, Zhipeng\nLiu, Xiujuan Chai, Zhuang Liu, et al. Multimodal gesture recognition based on the resc3d network. In\nProceedings of IEEE International Conference on Computer Vision (ICCV), pages 3047\u20133055, 2017.\n\n10\n\n\f", "award": [], "sourceid": 981, "authors": [{"given_name": "Liang", "family_name": "Zhang", "institution": "School of Computer Science and Technology, Xidian University, China"}, {"given_name": "Guangming", "family_name": "Zhu", "institution": "Xidian University"}, {"given_name": "Lin", "family_name": "Mei", "institution": "The Third Research Institute of Ministry of Public Security, China"}, {"given_name": "Peiyi", "family_name": "Shen", "institution": "School of Software, Xidian University, China"}, {"given_name": "Syed Afaq Ali", "family_name": "Shah", "institution": "Department of Computer Science and Software Engineering, The University of Western Australia"}, {"given_name": "Mohammed", "family_name": "Bennamoun", "institution": "University of Western Australia"}]}