{"title": "Chirality Nets for Human Pose Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 8163, "page_last": 8173, "abstract": "We propose Chirality Nets, a family of deep nets that is equivariant to the \u201cchirality transform,\u201d i.e., the transformation to create a chiral pair. Through parameter sharing, odd and even symmetry, we propose and prove variants of standard building blocks of deep nets that satisfy the equivariance property, including fully connected layers, convolutional layers, batch-normalization, and LSTM/GRU cells. The proposed layers lead to a more data efficient representation and a reduction in computation by exploiting symmetry. We evaluate chirality nets on the task of human pose regression, which naturally exploits the left/right mirroring of the human body. We study three pose regression tasks: 3D pose estimation from video, 2D pose forecasting, and skeleton based activity recognition. Our approach achieves/matches state-of-the-art results, with more significant gains on small datasets and limited-data settings.", "full_text": "Chirality Nets for Human Pose Regression\n\nRaymond A. Yeh\u2217, Yuan-Ting Hu*, Alexander G. Schwing\n\nDepartment of Electrical Engineering, University of Illinois at Urbana-Champaign\n\n{yeh17, ythu2, aschwing}@illinois.edu\n\nAbstract\n\nWe propose Chirality Nets, a family of deep nets that is equivariant to the \u201cchirality\ntransform,\u201d i.e., the transformation to create a chiral pair. Through parameter\nsharing, odd and even symmetry, we propose and prove variants of standard\nbuilding blocks of deep nets that satisfy the equivariance property, including fully\nconnected layers, convolutional layers, batch-normalization, and LSTM/GRU cells.\nThe proposed layers lead to a more data ef\ufb01cient representation and a reduction\nin computation by exploiting symmetry. We evaluate chirality nets on the task\nof human pose regression, which naturally exploits the left/right mirroring of the\nhuman body. We study three pose regression tasks: 3D pose estimation from\nvideo, 2D pose forecasting, and skeleton based activity recognition. Our approach\nachieves/matches state-of-the-art results, with more signi\ufb01cant gains on small\ndatasets and limited-data settings.\n\n1\n\nIntroduction\n\nHuman pose regression tasks such as human pose estimation, human pose forecasting and skeleton\nbased action recognition, have numerous applications in video understanding, security and human-\ncomputer interaction. For instance, collaborative virtual reality applications rely on accurate pose\nestimation for which signi\ufb01cant advances have been reported in recent years.\nSpeci\ufb01cally, recent state-of-the-art approaches use supervised learning to address pose regression and\nemploy deep nets. Input and output of those nets depend on the task: inputs are typically 2D or 3D\nhuman pose key-points stacked into a vector; the output may represent human pose key-points for\npose estimation or a classi\ufb01cation probability for activity recognition. To improve accuracy of those\ntasks, a variety of deep net architectures have been proposed [34, 3, 17, 29, 42, 48], generally relying\non common deep net building blocks, such as, fully connected, convolutional or recurrent layers.\nUnlike for image datasets, to enlarge the size of human pose datasets, a re\ufb02ection (left-right \ufb02ipping)\nof the pose coordinates as illustrated in step (1) of Fig. 1 is not suf\ufb01cient. The chirality of the human\npose requires to additionally switch the labeling of left and right as illustrated in step (2) of Fig. 1.\nHowever, while this two-step data augmentation is conceptually easy to employ during training, we\nargue that even better accuracy is possible for human pose regression tasks if this pose symmetry is\ndirectly built into the deep net. In particular, if confronted with either of the poses illustrated on the\nleft or right hand side of Fig. 1 the output of a deep net should be equivariant to the transformation,\ni.e., the output is also transformed in a \u201cprede\ufb01ned way.\u201d For example, if the network\u2019s output is also\na human pose, the output pose should follow the same transformation. On the other hand, for an\nactivity recognition task, the output probability should remain unchanged. The equivariant map, for\npose estimation, is illustrated in Fig. 2 and we make the equivariance property more precise later.\nTo encode this form of equivariance for human pose regression tasks, we propose \u201cchirality nets.\u201d\nSpeci\ufb01cally, the output of a chirality net is guaranteed to be equivariant w.r.t. a transformation\ncomposed of re\ufb02ections and label switching. To build chirality nets, we develop chirality equivariant\n\n\u2217Indicates equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Illustration of the chirality transformation. The transformation includes two operations, (1) a re\ufb02ection\nof the pose, i.e., a negation of the x-coordinates; and (2) a switch of the left / right joint labeling. The ordering of\nthe two operations are interchangeable.\n\nversions of commonly used layers. Speci\ufb01cally, we design and prove equivariance for versions of fully\nconnected, convolutional, batch-normalization, dropout, and LSTM/GRU layers and element-wise\nnon-linearities such as tanh or soft-sign.The main common design principle for chirality equivariant\nlayers is odd and even symmetric sharing of model parameters. Hence, in addition to being equivariant,\ntransforming a typical deep net into its chiral counterpart results in a reduction of the number of\ntrainable parameters, and lower computation complexity due to the symmetry in the model weights.\nWe \ufb01nd a smaller number of trainable parameters reduces the sample complexity, i.e., the models\nneed less training data.\nWe demonstrate the generalization and effectiveness of our approach on three pose regression tasks\nover four datasets: 3D pose estimation on the Human3.6m [22] and HumanEva dataset [49], 2D pose\nestimation on the Penn Action dataset [64] and skeleton-based action recognition on Kinetics-400\ndataset [23]. Our approach achieves state-of-the-art results with guarantees on equivariance, lower\nnumber of parameters, and robustness in low-resource settings.\n\n2 Related Work\n\nFirst we brie\ufb02y review invariance and equivariance in machine learning and computer vision as well\nas human pose regression tasks.\nInvariant and equivariant representation. Hand-crafted invariant and equivariant representations\nhave been utilized widely in computer vision systems for decades, e.g., scale invariance of SIFT [32],\norientation invariance of HOG [9], af\ufb01ne invariance of the Harris detector [36], shift-invariant systems\nin image processing [54], etc.\nThese properties have also been adapted to learned representations. A widely known property is\nthe translation equivariance of convolutional neural nets (CNN) [28]: through spatial or temporal\nparameter sharing, a shifted input leads to a shifted output. Group-equivariant CNNs extend the\nequivariance to rotation, mirror re\ufb02ection and translation [7] by replacing the shift operation with a\nmore general set of transformations. Other representations for building equivariance into deep nets\nhave also been proposed, e.g., the Symmetric Network [12], the Harmonic Network [57] and the\nSpherical CNN [8].\nThe aforementioned works focus on deep nets where the input are images. While related, they are\nnot directly applicable to human pose. For example, a re\ufb02ection with respect to the y-axis in the\nimage domain corresponds to a permutation of the pixel locations, i.e., swapping the pixel intensity\nbetween each pixel\u2019s re\ufb02ected counterpart. In contrast, for human pose, where the input is a vector\nrepresenting the human joints\u2019 spatial coordinates, a re\ufb02ection corresponds to the negation of the\nvalue for each of the joints re\ufb02ected dimension.\nThe input representation of deep nets for human pose is more similar to pointsets. Prior work has\nexplored building permutation equivariant deep nets, i.e., any permutation of input elements results in\nthe same permutation of output elements. In [62, 43]. Both works utilize parameter sharing to achieve\npermutation equivariance. Following these works, graph nets generalize the family of permutation\nequivariant networks and demonstrate success on numerous applications [46, 27, 14, 13, 1, 26, 61, 31].\nFor human pose, equivariance to all permutations is too strong of a property. Recall, our aim is to\nbuild models equivariant to the chiral symmetry, which only involves a speci\ufb01c permutation, e.g., the\nswitch between left and right joints, shown in step (2) of Fig. 1.\n\n2\n\n(1)(2)\fFigure 2: Illustration of chirality equivariance for the task of 2D to 3D pose estimation.\n\nMost relevant to our approach is work by Ravanbakhsh et al. [44]. Ravanbakhsh et al. [44] explore\nwhich type of equivariance can be achieved through parameter sharing. Their approach captures\none speci\ufb01c permutation in the pose symmetric transform, but does not capture the negation from\nthe re\ufb02ection, shown in Fig. 1 step (1). In contrast, our approach considers both operations (1)\nand (2) jointly, which leads to a different formulation. Lastly, to the best of our knowledge, [44]\nonly discusses theoretically the construction of equivariant networks. In this work, we design and\nimplement a variety of building blocks for deep nets and demonstrate the bene\ufb01ts on a wide range of\npractical applications in human pose regression tasks.\nHuman pose applications. For 3D pose estimation from images, recent approaches utilize a two-\nstep approach: (1) 2D pose keypoints are predicted given a video; (2) 3D keypoints are estimated\ngiven 2D joint locations. The 2D to 3D estimation is formulated as a regression task via deep\nnets [40, 52, 35, 51, 10, 41, 59, 33, 17, 29, 42]. Capturing the temporal information is crucial and\nhas been explored in 3D pose estimation [17, 29] as well as in action recognition [53, 20], video\nsegmentation[18, 19] and learning object dynamics [34, 37]. Most recently, Pavllo et al. [42] propose\nto use temporal convolutions to better capture the temporal information for 3D pose estimation over\nprevious RNN based methods. They also performed train and test time augmentation based on the\nchiral-symmetric transformation. For test time augmentation, they compute the output for both the\noriginal input and the transformed input, using the average outputs as the \ufb01nal prediction. In contrast\nto our work, we note that Pavllo et al. [42] need to transform the output of the transformed input back\nto the original pose. To carefully assess the bene\ufb01ts of chirality nets, in this work, we closely follow\nthe experiment setup of Pavllo et al. [42].\nFor 2D keypoint forecasting, we follow the setup of standard temporal modeling: conditioning on\npast observations to predict the future. To improve temporal modeling, recent works, have utilized\ndifferent sequence to sequence models for this task [34, 3, 5]. In this work, we closely follow the\nexperiment setup of Chiu et al. [5].\nFor action recognition, skeleton based methods have been explored extensively recently [58, 63, 30,\n48] due to robustness to illumination changes and cluttered background. Here we closely follow the\nexperimental setup of Yan et al. [58].\n\n3 Chirality Nets\n\nIn the following we \ufb01rst provide the problem formulation for human pose regression, before de\ufb01ning\nchirality nets, equivariance and the chirality transform. Subsequently we discuss how to develop\ntypical layers such as the fully connected layer, the convolution, etc., which make up chirality nets.\nThe Pytorch implementation and unit-tests of the proposed layers are part of the supplementary\nmaterial. We have also included a short Jupyter notebook demo to illustrate the key concepts.\n\n3.1 Problem Formulation\n\nChirality nets can be applied to regression tasks on coordinates of joints for human pose related task,\ni.e., the input corresponds to 2D or 3D coordinates of human joints. For readability, we introduce\nthe input and output representations for a single frame. Note that for our experiments we generalize\nchirality nets to multiple frames by introducing a time dimension.\n\n3\n\nChirality\tTransform\tInput(x)\ue240Input\t2D\tPose\u00a0xF\u03b8F\u03b8Output\t3D\tPose\u00a0yChirality\tTransform\tOutput(y)\ue240\ue240\ue240\fWe let x \u2208 R|J in|\u00b7|Din| denote the chirality net input, where J in is the set of all joints and Din is the\ndimension index set for an input coordinate. For example, J in = {\u2018right wrist\u2019, \u2018right shoulder\u2019, . . .}\nand Din = {0, 1}, for 2D input joint coordinates. Similarly, we let y \u2208 R|J out|\u00b7|Dout| refer to the\nchirality net output. Note that the dimension of the spatial coordinates at the input and output may be\ndifferent, e.g., prediction from 2D to 3D. Also, the number of joints may differ, e.g., when mapping\nbetween different key-point sets.\nFor human pose regression, the task is to learn the parameters \u03b8 of a model F\u03b8 by minimizing a loss\n(x,y)\u2208D (cid:96)(F\u03b8(x), y) over the training dataset D. Hereby, sample loss (cid:96)(F\u03b8(x), y)\n\nfunction, L(\u03b8) =(cid:80)\n\ncompares prediction F\u03b8 to ground-truth y.\n\n3.2 Chirality Nets, Chirality Equivariance, and Chirality Transforms\n\nChirality nets exhibit chirality equivariance, i.e., their output is transformed in a \u201cprede\ufb01ned manner\u201d\ngiven that the chirality transform is applied at the input. Note that the input and output dimensions\nDin and Dout may differ. To de\ufb01ne this chirality equivariance, we hence need to consider a pair of\ntransformations, one for the input data, T in, and one for the output data, T out. The corresponding\nequivariance map is illustrated in Fig. 2 for the task of 2D to 3D pose estimation. Formally, we say a\nfunction F\u03b8 is chirality equivariant w.r.t. (T in,T out) if\n\nT out(F\u03b8(x)) = F\u03b8(T in(x)) \u2200x \u2208 R|J in||Din|.\n\nl , J in\n\nr , and J in\n\nTo de\ufb01ne the chirality transform on the input data, i.e., T in, we split the set of joints J in into ordered\ntuples of J in\nc , each denoting left, right and center joints of the input. Importantly, these\ntuples are sorted such that the corresponding left/right joints are at corresponding positions in the\ntuple. We also split the dimension index set Din into Din\nn , indicating the\ncoordinates to, or not to, negate.\nFor readability and without loss of generality, assume the dimensions of the input x follow the order\nc , i.e., x = [xl, xr, xc]. Within each vector x(\u00b7), we place the coordinates in the set\nof J in\nn before the remaining ones, i.e., xl = [xln, xlp].\nDin\nGiven this construction of the input x, the re\ufb02ection illustrated in step (1) of Fig. 1 is a matrix\nmultiplication with a (|J in||Din|) \u00d7 (|J in||Din|) diagonal matrix T in\n\n:= Din\\Din\n\nn and Din\n\nneg, de\ufb01ned as follows:\n\nl , J in\n\nr , J in\n\np\n\nneg = diag([\u22121|Jin\nT in\n\nl |\u00b7|Din\n\nn |, 1|Jin\n\np |,\u22121|Jin\n\nl |\u00b7|Din\n\nr |\u00b7|Din\n\nn |, 1|Jin\n\nr |\u00b7|Din\n\np |,\u22121|Jin\n\nc |\u00b7|Din\n\nn |, 1|Jin\n\nc |\u00b7|Din\n\np |]),\n\n\uf8ee\uf8f0\n\n\uf8f9\uf8fb ,\n\nwhere 1K indicates a vector of ones of length K. The switch operation illustrated in step (2) of\nFig. 1 is a matrix multiplication with a permutation matrix of dimension (|J in||Din|) \u00d7 (|J in||Din|),\nde\ufb01ned as follows:\n\nI|J in\n\nI|J in\n\nI|J in\n\nT in\nswi =\n\n0\nl |\u00b7|Din|\n0\n\n0\n0\nc |\u00b7|Din|\n\nl |\u00b7|Din|\n0\n0\nwhere IK denotes an identity matrix of size K \u00d7 K.\nGiven those matrices, the chirality transform of the input T in(x) is obtained via T in(x) = T in\nThe chirality transform of the output, T out, is de\ufb01ned similarly, replacing \u201cin\u201d with \u201cout\u201d.\nIn the following, we introduce layers that satisfy the (T in,T out) chirality equivariance property. This\nenables to construct a chirality net F\u03b8, as the composition of equivariant layers remains equivariant.\nNote that (T in,T out) chirality equivariance can be speci\ufb01ed separately for every deep net layer\nwhich provides additional \ufb02exibility. In the following we discuss how to construct layers which\nsatisfy chirality equivariance.\n\nnegT in\n\nswix.\n\n3.3 Chirality Layers\n\nFully connected layer. A fully connected layer performs the mapping y = fFC(x; W, b) := W x + b.\nWe achieve equivariance through parameter sharing and odd symmetry:\n\n4\n\n\f\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n(cid:20)Wln,ln Wln,lp\n(cid:21)\n(cid:20) Wln,rn \u2212Wln,rp\n(cid:20)Wcn,ln Wcn,lp\n(cid:21)\n\nWlp,ln Wlp,lp\n\u2212Wlp,rn Wlp,rp\n\n(cid:20)Wln,rn Wln,rp\n(cid:20)Wln,cn Wln,cp\n(cid:21)\n(cid:21)\n(cid:21) (cid:20) Wln,cn \u2212Wln,cp\n(cid:21) (cid:20) Wln,ln \u2212Wln,lp\n(cid:20)Wcn,ln \u2212Wcn,lp\n(cid:20)Wcn,cn\n(cid:21)\n(cid:21)\n\nWlp,rn Wlp,rp\n\u2212Wlp,ln Wlp,lp\n\nWlp,cn Wlp,cp\n\u2212Wlp,cn Wlp,cp\n\n0\n\n0\n\nWcp,lp\n\n0\n\nWcp,lp\n\n0\n\nWcp,cp\n\n(cid:21)\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb, b =\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n(cid:20)bln\n(cid:21)\n(cid:20)\u2212bln\n(cid:21)\n(cid:20) 0\n(cid:21)\n\nblp\n\nblp\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb.\n\nbcp\n\nW =\n\n| \u00b7 |Dout\n\nWe color code the shared parameters using identical colors. Each W(\u00b7),(\u00b7) denotes a matrix, where\nthe \ufb01rst and the second subscript characterize the dimensions of the output and the input. For\nexample, Wln,rp computes the output\u2019s left (l) joint\u2019s negated (n) dimensions, from the input\u2019s right\n(r) joint\u2019s non-negated, i.e., positive (p), dimensions. Note that Wln,rp is a matrix of dimension\n|J out\n1D convolution layers [55, 28].\nPose symmetric 1D convolution layers can be based on fully\nconnected layers. A 1D convolution is a fully connected layer with shared parameters across the time\ndimension, i.e., at each time step the computation is the sum of fully connected layers over a window:\n\np |. We refer to this layer as the chiral fully connected layer.\n(cid:88)\n\nr | \u00b7 |Din\n\n| \u00d7 |J in\n\n(cid:88)\n\nl\n\nn\n\nyt =\n\nW\u03c4 xt\u2212\u03c4 + b =\n\nfFC(xt\u2212\u03c4 ; W\u03c4 , b).\n\n\u03c4\n\n\u03c4\n\nConsequently, we enforce equivariance at each time step by employing the symmetry pattern of fully\nconnected layers at each time slice.\nElement-wise nonlinearities. Nonlinearities are applied element-wise and do not contain parameters.\nThese operations maintain the input dimension, therefore, T out and T in are identical. A nonlinearity\nf that is an odd function, i.e., f (\u2212x) = \u2212f (x), such as tanh, hardtanh, or soft-sign satis\ufb01es the\nequivariance property. See the following proof:\n\nT out(f (x)) =\n\nodd func. f\n\n=\n\nelementwise f\n\nneg T out\nT out\n\n=\n\nswi x))\nswi (f (x))\nswi x) = f (T in(x)) \u2200x \u2208 R|J in||Din|.\n\nneg f (T out\nT out\n\nf (T out\n\nneg T out\n\nLSTM and GRU layers [16, 6]. LSTM and GRU modules which satisfy chirality can be obtained\nfrom fully connected layers. However, na\u00efvely setting all matrix multiplies within an LSTM to\nsatisfy the equivariance property will not lead to an equivariant LSTM because gates are elementwise\nmultiplied with the cell state. If both gate and cell preserve the negation then the product will not.\nTherefore, we change the weight sharing scheme for the gates. We set Dout\nn for the gates to be the\nempty set, i.e., the gates will be invariant to negation at the input, T in\nneg, but still equivariant to the\nswitch operation, T in\nswi. With this setup, the product of the gates and the cell\u2019s output will preserve the\nsign, as the gates are invariant to negation and passed through a Sigmoid to be within the range of\n(0, 1). GRU modules are modi\ufb01ed in the same manner.\nBatch-normalization [21]. A batch normalization layer performs an element-wise standardization,\nfollowed by an element-wise af\ufb01ne layer (with learnable parameters \u03b3 and \u03b2). For \u03b3 and \u03b2, we follow\nthe the principle applied to fully connected layers.\nEquivariance for \u00b5, and \u03c3 is obtained by computing the mean and standard deviation on the \u201caug-\nmented batch\u201d and by keeping track of its running average.\nDropout [50]. At test time, dropout scales the input by p, where p is the dropout probability. The\nequivariance property is satis\ufb01ed because of the associativity property of a scalar multiplication.\n\n3.4 Reduction in model parameters, FLOPS, and training/test details\n\nl | + |J in\n\nr | + |J in\n\nModel parameters. Our model shares parameters between dimensions representing the left and right\njoints. For each layer, the number of parameters are reduced by a factor of |(|J in\n|)\n.\nRecall |J in| = |J in\nFLOPS. Chirality nets also have lower FLOPS. Due to the symmetry, instead of multiplying and\nadding each of the elements independently, we add the symmetric values \ufb01rst before applying a\nsingle multiplication per symmetric pair. Concretely, consider w = [w1, w1], x = [x1, x2], and their\ninner product wT x. Instead of computing w1 \u00b7 x1 + w1 \u00b7 x2, we exploit symmetry and use instead\nw1 \u00b7 (x1 + x2), which removes one multiplication operation. This is a common speed up trick used\n\nc |. The output dimension size is computed similarly.\n\nc |)\u00b7(|J out\n|J in|\u00b7|J out|\n\nl |+|J in\n\n|+|J out\n\nl\n\nc\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\n60.9\n43.6\n57.9\n62.4\n58.2\n55.3\n49.7\n49.0\n54.6\n48.6\n\n44.7\n60.1\n44.4\n46.6\n43.3\n39.5\n33.1\n32.8\n40.1\n32.7\n\n47.8\n47.7\n48.9\n49.6\n43.3\n42.7\n34.0\n33.9\n43.0\n33.3\n\n56.6\n57.4\n53.7\n59.0\n54.1\n52.4\n47.9\n47.1\n52.0\n47.1\n\n52.9\n58.4\n48.9\n57.3\n55.6\n49.5\n44.7\n44.0\n48.6\n43.9\n\n59.4\n62.1\n51.8\n63.1\n50.1\n53.6\n50.0\n48.1\n53.8\n49.0\n\n65.3 49.9\n65.4 49.8\n60.3 48.5\n72.6 53.0\n75.0 50.2\n61.4 49.4\n56.9 45.6\n55.1 44.6\n61.2 48.3\n55.2 44.6\n\n52.9 65.8 71.1\n52.7 69.2 85.2\n51.7 61.5 70.9\n51.7 66.1 80.9\n43.0 55.8 73.9\n47.4 59.3 67.4\n44.6 58.8 66.8\n44.3 57.3 65.8\n45.9 60.4 67.1\n44.0 58.3 62.7\n\nFigure 3: Illustration of pose regression tasks: (a) 2D to 3D pose estimation; (b) 2D pose forecasting; and (c)\nskeleton-based action recognition.\nDir. Disc. Eat Greet Phone Photo Pose Purch. Sit SitD. Smoke Wait WalkD. Walk WalkT. Avg\nApproach\n56.2\n48.5 54.4 54.4 52.0\nPavlakos [41] (CVPR\u201818)\n58.6\n51.5 58.9 50.4 57.0\nYang [59] (CVPR\u201818)\nLuvizon [33] (CVPR\u201818) ((cid:5))\n49.2 51.6 47.6 50.5\n53.2\nHossain [17] (ECCV\u201818)(\u2020, (cid:5)) 48.4 50.7 57.2 55.2\n58.3\nLee [29] (ECCV\u201818)(\u2020, (cid:5))\n40.2 49.2 47.8 52.6\n52.8\nPavllo [42] (CVPR\u201819)\n51.8\n47.1 50.6 49.0 51.8\nPavllo [42] (CVPR\u201819)(\u2020)\n47.7\n45.9 47.5 44.3 46.4\nPavllo [42] (CVPR\u201819)(\u2020, \u2021)\n45.2 46.7 43.3 45.6\n46.8\nOurs, single-frame\n47.4 49.9 47.4 51.1\n51.4\nOurs (\u2020)\n46.7\n44.8 46.1 43.3 46.4\nTable 1: Results on the Human3.6M dataset: reconstruction error using Protocol 1 (MPJPE) in mm. The best\nresult is boldface and the second best is underlined. \u2020 indicates temporal models, (cid:5) uses ground-truth bounding\nbox, and \u2021 indicates test-time augmentation.\nin symmetric FIR \ufb01lters [38, 60]. The number of multiplications reduces by a factor of |J in\nl |+|J in\nc |\n.\n|J in|\nAdditionally, baseline models utilize test-time augmentation, which requires two forward passes\nthrough the network for each input, whereas the proposed nets only use a single forward pass.\nTraining and test details. During training it is important to apply the chirality transform for data-\naugmentation, i.e., with 50% probability we apply T in and T out to input and label. This ensures that\nthe mini-batch statistics match our assumption on the chirality, i.e., poses that form a chiral pair are\nboth valid, which is important for the batch-normalization layer. Moreover, during training we use a\nstandard dropout layer. While we could impose dropped units to be chiral equivariant, we found this\nlead to over-\ufb01tting in practice. This is expected as imposing chirality on the added noise reduces the\nrandomness. Importantly, during test no data-augmentation is performed and a single forward pass is\nsuf\ufb01cient to obtain an \u2018averaged\u2019 result.\n\n4 Experiments\n\nWe evaluate our approach on a variety of tasks, including 2D to 3D pose estimation, 2D pose\nforecasting, and skeleton based action recognition. For each task, we describe the dataset, metric, and\nimplementation before discussing the results.\n\n4.1\n\n2D to 3D pose estimation\n\nTask. 3D human pose estimation can be decoupled into the tasks of 2D keypoint detection and 2D to\n3D pose estimation. We focus on the latter task, i.e., given a sequence of 2D keypoints, the task is to\nestimate the corresponding 3D human pose. See Fig. 3 (a) for an illustration.\nDataset and metric. We evaluate on two standard datasets, the Human3.6M [22] and the HumanEva-\nI [49]. Human3.6M is a large scale dataset of human motion with 3.6 million video frames. The dataset\nconsists of 11 subjects performing 15 different actions. Following prior work [40, 52, 35, 51, 33, 42],\neach human pose is represented by a 17-joint skeleton. We use the same train and test subject splits.\nHumanEva-I is a smaller dataset consisting of four subjects and six actions. To be consistent with\nprior work [41, 29, 42], we use the same train and test splits evaluated over the actions of (walk, jog,\nand box). For both of these datasets, we consider the setting where we train one model for all actions.\nWe report the two standard metrics used in prior work: Protocol 1 (MPJPE) which is the mean per-\njoint position error between the prediction and ground-truth [35, 40, 42] and Protocol 2 (P-MPJPE)\nwhich is the error, after alignment, between the prediction and ground-truth [35, 51, 17, 42].\n\n6\n\n2D to 3D pose estimationxyxyz2D pose forecastingxySkeleton based action recognitionClass ProbabilityTennisJoggingDrawing \u2026\u2026xy\fAvg.\n\nBox\nS2\n\u2013\n\u2013\n\nWalk\nS2\n\nJog\nS2\n\nS3\n\nS1\n\nS1\n\u2013\n\u2013\n\nS3\n\u2013\n\u2013\n\nApp.\nS1\nS3\n-\nPavlakos [40]\n22.3 19.5 29.7 28.9 21.9 23.8\n\u2013\n18.8 12.7 29.2 23.5 15.4 14.5\nPavlakos [41]\n\u2013\nLee [29]\n18.6 19.9 30.5 25.7 16.8 17.7 42.8 48.1 53.4\n\u2013\nPavllo [42]\n14.1 10.4 46.8 21.1 13.3 14.0 23.8 34.5 32.3 31.1\nPavllo [42] (\u2021) 13.9 10.2 46.6 20.9 13.1 13.8 23.8 33.7 32.0 30.8\n15.2 10.3 47.0 21.8 13.1 13.7 22.8 31.8 31.0 30.6\nOurs\nTable 2: Results on HumanEva-I for multi-action (MA) mod-\nels reported in Protocol 2 (P-MPJPE), lower the better. \u2021\nindicates test time augmentation.\n\nFigure 4: Comparisons between our approach\nand [42] in limited data settings evaluated using\nProtocol 1 on Human3.6M.\n\nPrediction Steps\n\n8\n\n9\n\nAvg.\n\n-\n\n4\n\n1\n\n5\n\n6\n\n7\n\n16\n\n2\n\n3\n\n14\n\n13\n\n12\n\n11\n\n10\n\n15\n\nApproach\nResidual [34] (CVPR\u201817) 82.4 68.3 58.5 50.9 44.7 40.0 36.4 33.4 31.3 29.5 28.3 27.3 26.4 25.7 25.0 24.5 39.5\n3D-PFNet [3](CVPR\u201817)\n79.2 60.0 49.0 43.9 41.5 40.3 39.8 39.7 40.1 40.5 41.1 41.6 42.3 42.9 43.2 43.3 45.5\n84.5 72.0 64.8 60.3 57.2 55.0 53.4 52.1 50.9 50.0 49.3 48.7 48.3 47.9 47.6 47.3 55.6\nTP-RNN [5] (WACV\u201819)\n87.3 75.7 68.5 64.0 61.0 59.1 57.6 56.3 55.4 54.9 54.5 54.5 54.4 54.5 54.6 54.7 60.4\nBaseline w/o aug.\n86.9 75.2 67.9 63.5 60.4 58.4 57.0 55.8 55.1 54.5 54.1 54.0 53.9 53.9 54.0 54.0 59.9\nBaseline w/ aug.\nBaseline w/ aug.(\u2021)\n87.0 75.5 68.4 64.1 61.0 59.1 57.5 56.3 55.5 55.0 54.7 54.7 54.6 54.7 54.7 54.7 60.5\n87.5 77.0 68.7 64.2 61.2 59.2 57.6 56.5 55.7 55.1 54.7 54.6 54.4 54.5 54.5 54.5 60.6\nOurs\nTable 3: Results on Penn action dataset, performance reported in terms of PCK@0.05 (higher the better). (\u2021)\nindicates using test time augmentation.\nImplementation details. Our model follows the supervised training procedure and network design\nof Pavllo et al. [42]. Our network is the identical temporal convolutional network architecture,\nwhere each layer is replaced with its chiral version, i.e., 1D dilated convolution, batch-normalization,\nand dropout layers. We also replace ReLU non-linearities with Tanh to achieve equivariance. No\nadditional architecture changes were made. For Human3.6M, we use 2D keypoints extracted from\nCPN [4] with Mask R-CNN [15] bounding boxes released by Pavllo et al. [42]. For HumanEva-I, we\nuse the 2D keypoint detections from Mask R-CNN released by Pavllo et al. [42].\nResults. In Tab. 1, we report the performance on the Human3.6M data using Protocol 1 (MPJPE).\nOur approach outperforms the state-of-the-art [42] which uses test-time augmentation by 0.1 mm\nin overall average and achieves the best results in eight out of \ufb01fteen sub-categories. For the single-\nframe models, we observe a more signi\ufb01cant reduction in error of 0.4 mm over [42] with test\ntime augmentation. Additionally, when comparing without test-time augmentation, our approach\noutperforms by 1 mm. We note that, test-time augmentation employed by Pavllo et al. [42] involves\nrunning the network twice for each input. In contrast, our approach only requires a single forward\npass.\nNext, on HumanEva-I dataset, we also observed an increase in performance using Protocol 1. On\naverage, our approach achieves a 32.2mm error. This is a 0.8mm decrease over the current state-of-\nthe-art of 33.0mm [42] and a 1.1mm decrease over [42] without test-time augmentation of 33.3mm.\nWe also performed evaluation using Protocol 2 (P-MPJPE). On Human3.6M we observe that our\napproach performs worse than Pavllo et al. [42] by 0.3mm. We note that the loss function is chosen\nto optimize Protocol 1, therefore our models are performing better at what they are optimized for. In\nTab. 2, we report the performance on HumanEva-I using Protocol 2 (P-MPJPE). Our model achieves\na 0.2 mm reduction in error over Pavllo et al. [42] on average. Most of the gain is obtained for the\nboxing action, possibly due to the symmetric nature of the movement.\nLimited data settings. A bene\ufb01t of fewer model parameters is the potential to obtain better models\nwith less data. To con\ufb01rm this, we perform experiments by varying the amount of training data,\nstarting from 0.1% of subject 1 (S1) to using three subjects S1, S5, S6. The results with comparison\nto [42] are shown in Fig. 4. We observe that our approach consistently out-performs [42] in this low\nresource settings, except at S1 0.1%. For the reported numbers, we use a batch-size of 64, and all\nother hyper-parameters are identical between the models. If we further decrease the batch-size to 32\nfor S1 0.1%, our approach improves to 100.4mm where [42] improves to 102.3mm.\n\n4.2\n\n2D pose forecasting\n\nTask. 2D pose forecasting is the pose regression task of predicting the future human pose, represented\nin 2D keypoints, given present and past human pose. See Fig. 3 (b) for an illustration.\n\n7\n\n105.699.593.785.471.967.165.159.9108.9938577.868.263.96256.65060708090100110S1 .1%S1 1%S1 5%S1 10%S1 50%S1 100%S15S156MPJPE (mm)Training SplitsPavllo et al.Ours\fDataset and metric. We evaluate on the Penn Action dataset [64]. The dataset consists of 2236\nvideos with 15 actions. Each frame is annotated with 2D keypoints of 13 human joints. We use the\nsame train and test split as in [3, 5]. Following Chiu et al. [5] we consider initial velocity as being\npart of the input and a single model is used for all actions. For a fair comparison with prior work,\nwe report the \u2018Percentage of Correct Keypoint\u2019 metric with a 0.05 threshold (PCK@0.05), which\nassesses the accuracy of the predicted keypoints. A predicted keypoint is considered correct if it is\nwithin a 0.05 radius of the ground-truth when considering normalized distance.\nImplementation details. Our non-chiral equivariant baseline model is a sequence-to-sequence\nmodel based on [34]. We made several modi\ufb01cations to match the hyperparameters in [5], i.e., we\nused StackedRNN [39] with 2 layers and added dropout layers. Additionally, we utilize teacher\nforcing [56] during training, while prior work did not. We \ufb01nd this to stabilize training and enable\nthe use of the Adam [25, 45] optimizer without diverging. We performed data augmentation via the\nchirality transform, i.e., with 0.5 probability we apply T in and T out to the input and the ground-truth\ncorrespondingly. For our pose symmetric model, we replaced all the non-symmetric layers, e.g., fully\nconnected layers and LSTM cells with their corresponding chiral version.\nResults. In Tab. 3, we report the performance of our models and the state-of-the-art. The base-\nline model without augmentation outperforms the state-of-the-art [5]. The gain comes from the\nuse of Stacked-LSTM and teacher forcing during training. With additional train and test time\ndata-augmentation, our baseline model further improves. In addition our pose symmetric model\noutperforms the baseline, in terms of average PCK@0.05. We observe more signi\ufb01cant improvements\nfor the \ufb01rst ten prediction steps.\n\n4.3 Skeleton based action recognition\n\nTop-1\nTop-5\n14.9% 25.8%\n16.4% 35.3%\n20.3% 40.0%\n30.7% 52.8%\n30.8% 52.6%\n30.9% 53.0%\n\nApproach\nFeature Encoding [11]\nDeep LSTM [47]\nTemporal-Conv [24]\nST-GCN [58]\nOurs-Conv\nOurs-Conv-Chiral\nTable 4: Results of the skeleton based\naction recognition baselines on the\nKinetics-400 dataset [23] reported in\nTop-1 and Top-5 accuracy.\n\nTask. Skeleton based action recognition aims at predicting\nhuman action based on skeleton sequences. See Fig. 3 (c) for\nan illustration.\nDataset and metric. We use the Kinetics-400 dataset [23]\nin our experiments. The dataset contains 400 action classes\nand 306,245 clips in total. Following the experimental setup\nby [58], we use OpenPose [2] to locate the 18 human body joints. Each joint is represented as\n(x, y, c), where x and y are the 2D coordinates of the joint and c is the con\ufb01dence score of the joint\ngiven by OpenPose. Following [23], we report the classi\ufb01cation accuracy at top-1 and top-5.\nImplementation details. Our baseline model, \u2018Ours-Conv,\u2019 follows \u2018Temporal-Conv\u2019 [24], modi\ufb01ed\nto have not only temporal convolution but also spatial convolution. The temporal convolution consid-\ners the intra-frame information while the spatial convolution considers the inter-frame information.\nFor the recognition task, we need chiral invariance, i.e., a chiral pair should be classi\ufb01ed as the same\naction class. To this end, we use a chiral invariance layer where we let both J out\nas well as Dout\nbe empty sets, which means there are no left and right joints but only center joints and there is no\ndimension that will be negated in the output of the layer after applying chirality transform. Note that\nthe chirality transform exchanges the left and right joints and negates the dimensions in the dimension\nindex set Dout\nare all empty, it\u2019s trivial that the output will be chiral\ninvariant. For the chiral invariance model, \u2018Ours-Conv-Chiral,\u2019 we replace all the non-symmetric\nlayers before the chiral invariance layer with their corresponding chiral equivariance version. All the\nlayers after the chiral invariance layer remain identical to the \u2018Ours-Conv\u2019 model. There are in total\n10 layers of spatial and temporal convolution and we put the chiral invariance layer at the fourth layer.\nWe use the SGD optimizer with a momentum of 0.9 as in [58].\nResults. In Tab. 4, we report the action recognition performance of our model and the skeleton-based\napproaches. We observe that the baseline model \u2018Ours-Conv\u2019 performs on par with ST-GCN [58] and\nthe chiral invariant model, \u2018Ours-Conv-Chiral\u2019 outperforms both ST-GCN and Ours-Conv on Top-1\nand Top-5 accuracy, achieving the state-of-the-art performance on the Kinetics-400 dataset among\nskeleton based action recognition methods.\n\n. Given J out\n\n, J out\n\nl\n\nand Dout\n\nn\n\n, J out\n\nl\n\nr\n\nn\n\nr\n\nn\n\n5 Conclusion\n\nWe introduce chirality equivariance for pose regression tasks and develop deep net layers that satisfy\nthis property. Through parameter sharing and odd/even symmetry, we design equivariant versions of\n\n8\n\n\fcommonly used layers in deep nets, including fully connected, 1D convolution, LSTM/GRU cells,\nand batch normalization layers. With these equivariant layers at hand, we build Chirality Nets, which\nguarantee equivariance from the input to the output. Our models naturally lead to a reduction in\ntrainable parameters and computation due to symmetry. Our experimental results on three human pose\nregression tasks over four datasets demonstrate state-of-the-art performance and the wide practical\nimpact of the proposed layers.\n\nAcknowledgments: This work is supported in part by NSF under Grant No. 1718221 and MRI\n#1725729, UIUC, Samsung, 3M, Cisco Systems Inc. (Gift Award CG 1377144) and Adobe. We\nthank NVIDIA for providing GPUs used for this work and Cisco for access to the Arcetri cluster. RY\nis supported by a Google PhD Fellowship.\n\nReferences\n[1] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti,\nD. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks.\narXiv preprint arXiv:1806.01261, 2018.\n\n[2] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh. OpenPose: realtime multi-person 2D pose\n\nestimation using Part Af\ufb01nity Fields. In arXiv preprint arXiv:1812.08008, 2018.\n\n[3] Y.-W. Chao, J. Yang, B. Price, S. Cohen, and J. Deng. Forecasting human dynamics from static images. In\n\nProc. CVPR, 2017.\n\n[4] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun. Cascaded pyramid network for multi-person pose\n\nestimation. In Proc. CVPR, 2018.\n\n[5] H.-k. Chiu, E. Adeli, B. Wang, D.-A. Huang, and J. C. Niebles. Action-agnostic human pose forecasting.\n\nIn Proc. WACV, 2019.\n\n[6] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning\nphrase representations using RNN encoder\u2013decoder for statistical machine translation. In Proc. EMNLP,\n2014.\n\n[7] T. Cohen and M. Welling. Group equivariant convolutional networks. In Proc. ICML, 2016.\n\n[8] T. S. Cohen, M. Geiger, J. K\u00f6hler, and M. Welling. Spherical cnns. In Proc. ICLR, 2018.\n\n[9] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc. CVPR, 2005.\n\n[10] H.-S. Fang, Y. Xu, W. Wang, X. Liu, and S.-C. Zhu. Learning pose grammar to encode human body\n\ncon\ufb01guration for 3D pose estimation. In Proc. AAAI, 2018.\n\n[11] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars. Modeling video evolution for\n\naction recognition. In Proc. CVPR, 2015.\n\n[12] R. Gens and P. M. Domingos. Deep symmetry networks. In Proc. NeurIPS, 2014.\n\n[13] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum\n\nchemistry. In Proc. ICML, 2017.\n\n[14] W. L. Hamilton, R. Ying, and J. Leskovec. Representation learning on graphs: Methods and applications.\n\narXiv preprint arXiv:1709.05584, 2017.\n\n[15] K. He, G. Gkioxari, P. Doll\u00e1r, and R. Girshick. Mask r-cnn. In Proc. ICCV, 2017.\n\n[16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.\n\n[17] M. R. I. Hossain and J. J. Little. Exploiting temporal information for 3D human pose estimation. In Proc.\n\nECCV, 2018.\n\n[18] Y.-T. Hu, J.-B. Huang, and A. G. Schwing. MaskRNN: Instance Level Video Object Segmentation. In\n\nProc. NeurIPS, 2017.\n\n[19] Y.-T. Hu, J.-B. Huang, and A. G. Schwing. VideoMatch: Matching based Video Object Segmentation. In\n\nProc. ECCV, 2018.\n\n9\n\n\f[20] N. Hussein, E. Gavves, and A. W. Smeulders. Timeception for complex action recognition. In Proc. CVPR,\n\n2019.\n\n[21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. In Proc. ICML, 2015.\n\n[22] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive\n\nmethods for 3D human sensing in natural environments. PAMI, 2014.\n\n[23] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back,\n\nP. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.\n\n[24] T. S. Kim and A. Reiter. Interpretable 3D human action analysis with temporal convolutional networks. In\n\nProc. CVPRW, 2017.\n\n[25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. ICLR, 2015.\n\n[26] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel. Neural relational inference for interacting\n\nsystems. In Proc. ICML, 2018.\n\n[27] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks. In Proc.\n\nICLR, 2017.\n\n[28] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio. Object recognition with gradient-based learning. In Shape,\n\ncontour and grouping in computer vision. 1999.\n\n[29] K. Lee, I. Lee, and S. Lee. Propagating lstm: 3D pose estimation based on joint interdependency. In Proc.\n\nECCV, 2018.\n\n[30] C. Li, Q. Zhong, D. Xie, and S. Pu. Co-occurrence feature learning from skeleton data for action recognition\n\nand detection with hierarchical aggregation. In Proc. IJCAI, 2018.\n\n[31] I.-J. Liu*, R. A. Yeh*, and A. G. Schwing. PIC: Permutation invariant critic for multi-agent deep\n\nreinforcement learning. In Proc. CORL, 2019. \u2217 equal contribution.\n\n[32] D. G. Lowe et al. Object recognition from local scale-invariant features. In Proc. ICCV, 1999.\n\n[33] D. C. Luvizon, D. Picard, and H. Tabia. 2D/3D pose estimation and action recognition using multitask\n\ndeep learning. In Proc. CVPR, 2018.\n\n[34] J. Martinez, M. J. Black, and J. Romero. On human motion prediction using recurrent neural networks. In\n\nProc. CVPR, 2017.\n\n[35] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for 3D human pose\n\nestimation. In Proc. ICCV, 2017.\n\n[36] K. Mikolajczyk and C. Schmid. Scale & af\ufb01ne invariant interest point detectors. IJCV, 2004.\n\n[37] M. Minderer, C. Sun, R. Villegas, F. Cole, K. Murphy, and H. Lee. Unsupervised learning of object\n\nstructure and dynamics from videos. In Proc. NeurIPS, 2019.\n\n[38] Note, Altera Application. Implementing \ufb01r \ufb01lters in \ufb02ex devices. Altera Corporation, Feb, 1998. URL\n\nhttp://www.ee.ic.ac.uk/pcheung/teaching/ee3_dsd/fir.pdf.\n\n[39] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. How to construct deep recurrent neural networks. In\n\nProc. ICLR, 2014.\n\n[40] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse-to-\ufb01ne volumetric prediction for\n\nsingle-image 3D human pose. In Proc. CVPR.\n\n[41] G. Pavlakos, X. Zhou, and K. Daniilidis. Ordinal depth supervision for 3D human pose estimation. In\n\nProc. CVPR, 2018.\n\n[42] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli. 3D human pose estimation in video with temporal\n\nconvolutions and semi-supervised training. In Proc. CVPR, 2019.\n\n[43] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3D classi\ufb01cation and\n\nsegmentation. In Proc. CVPR, 2017.\n\n[44] S. Ravanbakhsh, J. Schneider, and B. Poczos. Equivariance through parameter-sharing. In Proc. ICML,\n\n2017.\n\n10\n\n\f[45] S. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In Proc. ICLR, 2018.\n\n[46] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model.\n\nIEEE Trans. Neural Netw., 2009.\n\n[47] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. NTU RGB+ D: A large scale dataset for 3D human activity\n\nanalysis. In Proc. CVPR, 2016.\n\n[48] C. Si, Y. Jing, W. Wang, L. Wang, and T. Tan. Skeleton-based action recognition with spatial reasoning\n\nand temporal stack learning. In Proc. ECCV, 2018.\n\n[49] L. Sigal, A. O. Balan, and M. J. Black. Humaneva: Synchronized video and motion capture dataset and\n\nbaseline algorithm for evaluation of articulated human motion. IJCV, 2010.\n\n[50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to\n\nprevent neural networks from over\ufb01tting. JMLR, 2014.\n\n[51] X. Sun, J. Shang, S. Liang, and Y. Wei. Compositional human pose regression. In Proc. ICCV, 2017.\n\n[52] B. Tekin, P. M\u00e1rquez-Neila, M. Salzmann, and P. Fua. Learning to fuse 2D and 3D image cues for\n\nmonocular body pose estimation. In Proc. ICCV, 2017.\n\n[53] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal\n\nconvolutions for action recognition. In Proc. CVPR, 2018.\n\n[54] M. Vetterli, J. Kova\u02c7cevi\u00b4c, and V. K. Goyal. Foundations of signal processing. Cambridge University Press,\n\n2014.\n\n[55] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang. Phoneme recognition using time-delay\n\nneural networks. Backpropagation: Theory, Architectures and Applications, 1995.\n\n[56] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks.\n\nNeural computation, 1989.\n\n[57] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Harmonic networks: Deep translation\n\nand rotation equivariance. In Proc. CVPR, 2017.\n\n[58] S. Yan, Y. Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeleton-based action\n\nrecognition. In Proc. AAAI, 2018.\n\n[59] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang. 3D human pose estimation in the wild by\n\nadversarial learning. In Proc. CVPR, 2018.\n\n[60] R. A. Yeh, M. Hasegawa-Johnson, and M. N. Do. Stable and symmetric \ufb01lter convolutional neural network.\n\nIn Proc. ICASSP, 2016.\n\n[61] R. A. Yeh, A. G. Schwing, J. Huang, and K. Murphy. Diverse generation for multi-agent sports games. In\n\nProc. CVPR, 2019.\n\n[62] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In\n\nProc. NeurIPS, 2017.\n\n[63] P. Zhang, J. Xue, C. Lan, W. Zeng, Z. Gao, and N. Zheng. Adding attentiveness to the neurons in recurrent\n\nneural networks. In Proc. ECCV, 2018.\n\n[64] W. Zhang, M. Zhu, and K. G. Derpanis. From actemes to action: A strongly-supervised representation for\n\ndetailed action understanding. In Proc. ICCV, 2013.\n\n11\n\n\f", "award": [], "sourceid": 4452, "authors": [{"given_name": "Raymond", "family_name": "Yeh", "institution": "University of Illinois at Urbana\u2013Champaign"}, {"given_name": "Yuan-Ting", "family_name": "Hu", "institution": "University of Illinois Urbana-Champaign"}, {"given_name": "Alexander", "family_name": "Schwing", "institution": "University of Illinois at Urbana-Champaign"}]}