{"title": "Deep Learning for Precipitation Nowcasting: A Benchmark and A New Model", "book": "Advances in Neural Information Processing Systems", "page_first": 5617, "page_last": 5627, "abstract": "With the goal of making high-resolution forecasts of regional rainfall, precipitation nowcasting has become an important and fundamental technology underlying various public services ranging from rainstorm warnings to flight safety. Recently, the Convolutional LSTM (ConvLSTM) model has been shown to outperform traditional optical flow based methods for precipitation nowcasting, suggesting that deep learning models have a huge potential for solving the problem. However, the convolutional recurrence structure in ConvLSTM-based models is location-invariant while natural motion and transformation (e.g., rotation) are location-variant in general. Furthermore, since deep-learning-based precipitation nowcasting is a newly emerging area, clear evaluation protocols have not yet been established. To address these problems, we propose both a new model and a benchmark for precipitation nowcasting. Specifically, we go beyond ConvLSTM and propose the Trajectory GRU (TrajGRU) model that can actively learn the location-variant structure for recurrent connections. Besides, we provide a benchmark that includes a real-world large-scale dataset from the Hong Kong Observatory, a new training loss, and a comprehensive evaluation protocol to facilitate future research and gauge the state of the art.", "full_text": "Deep Learning for Precipitation Nowcasting:\n\nA Benchmark and A New Model\n\nXingjian Shi, Zhihan Gao, Leonard Lausen, Hao Wang, Dit-Yan Yeung\n\nDepartment of Computer Science and Engineering\nHong Kong University of Science and Technology\n\n{xshiab,zgaoag,lelausen,hwangaz,dyyeung}@cse.ust.hk\n\nWai-kin Wong, Wang-chun Woo\n\nHong Kong Observatory\n\nHong Kong, China\n\n{wkwong,wcwoo}@hko.gov.hk\n\nAbstract\n\nWith the goal of making high-resolution forecasts of regional rainfall, precipita-\ntion nowcasting has become an important and fundamental technology underlying\nvarious public services ranging from rainstorm warnings to \ufb02ight safety. Recently,\nthe Convolutional LSTM (ConvLSTM) model has been shown to outperform tradi-\ntional optical \ufb02ow based methods for precipitation nowcasting, suggesting that deep\nlearning models have a huge potential for solving the problem. However, the con-\nvolutional recurrence structure in ConvLSTM-based models is location-invariant\nwhile natural motion and transformation (e.g., rotation) are location-variant in gen-\neral. Furthermore, since deep-learning-based precipitation nowcasting is a newly\nemerging area, clear evaluation protocols have not yet been established. To address\nthese problems, we propose both a new model and a benchmark for precipitation\nnowcasting. Speci\ufb01cally, we go beyond ConvLSTM and propose the Trajectory\nGRU (TrajGRU) model that can actively learn the location-variant structure for\nrecurrent connections. Besides, we provide a benchmark that includes a real-world\nlarge-scale dataset from the Hong Kong Observatory, a new training loss, and a\ncomprehensive evaluation protocol to facilitate future research and gauge the state\nof the art.\n\n1\n\nIntroduction\n\nPrecipitation nowcasting refers to the problem of providing very short range (e.g., 0-6 hours) forecast\nof the rainfall intensity in a local region based on radar echo maps1, rain gauge and other observation\ndata as well as the Numerical Weather Prediction (NWP) models. It signi\ufb01cantly impacts the daily\nlives of many and plays a vital role in many real-world applications. Among other possibilities,\nit helps to facilitate drivers by predicting road conditions, enhances \ufb02ight safety by providing\nweather guidance for regional aviation, and avoids casualties by issuing citywide rainfall alerts.\nIn addition to the inherent complexities of the atmosphere and relevant dynamical processes, the\never-growing need for real-time, large-scale, and \ufb01ne-grained precipitation nowcasting poses extra\nchallenges to the meteorological community and has aroused research interest in the machine learning\ncommunity [23, 25].\n\n1The radar echo maps are Constant Altitude Plan Position Indicator (CAPPI) images which can be converted\n\nto rainfall intensity maps using the Marshall-Palmer relationship or Z-R relationship [19].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThe conventional approaches to precipitation nowcasting used by existing operational systems rely\non optical \ufb02ow [28]. In a modern day nowcasting system, the convective cloud movements are\n\ufb01rst estimated from the observed radar echo maps by optical \ufb02ow and are then used to predict the\nfuture radar echo maps using semi-Lagrangian advection. However, these methods are unsupervised\nfrom the machine learning point of view in that they do not take advantage of the vast amount of\nexisting radar echo data. Recently, progress has been made by utilizing supervised deep learning [15]\ntechniques for precipitation nowcasting. Shi et al. [23] formulated precipitation nowcasting as a\nspatiotemporal sequence forecasting problem and proposed the Convolutional Long Short-Term\nMemory (ConvLSTM) model, which extends the LSTM [7] by having convolutional structures in both\nthe input-to-state and state-to-state transitions, to solve the problem. Using the radar echo sequences\nfor model training, the authors showed that ConvLSTM is better at capturing the spatiotemporal\ncorrelations than the fully-connected LSTM and gives more accurate predictions than the Real-time\nOptical \ufb02ow by Variational methods for Echoes of Radar (ROVER) algorithm [28] currently used by\nthe Hong Kong Observatory (HKO).\nHowever, despite their pioneering effort in this interesting direction, the paper has some de\ufb01ciencies.\nFirst, the deep learning model is only evaluated on a relatively small dataset containing 97 rainy\ndays and only the nowcasting skill score at the 0.5mm/h rain-rate threshold is compared. As\nreal-world precipitation nowcasting systems need to pay additional attention to heavier rainfall\nevents such as rainstorms which cause more threat to the society, the performance at the 0.5mm/h\nthreshold (indicating raining or not) alone is not suf\ufb01cient for demonstrating the algorithm\u2019s overall\nperformance [28]. In fact, as the area Deep Learning for Precipitation Nowcasting is still in its early\nstage, it is not clear how models should be evaluated to meet the need of real-world applications.\nSecond, although the convolutional recurrence structure used in ConvLSTM is better than the fully-\nconnected recurrent structure in capturing spatiotemporal correlations, it is not optimal and leaves\nroom for improvement. For motion patterns like rotation and scaling, the local correlation structure\nof consecutive frames will be different for different spatial locations and timestamps. It is thus\ninef\ufb01cient to use convolution which uses a location-invariant \ufb01lter to represent such location-variant\nrelationship. Previous attempts have tried to solve the problem by revising the output of a recurrent\nneural network (RNN) from the raw prediction to be some location-variant transformation of the\ninput, like optical \ufb02ow or dynamic local \ufb01lter [5, 3]. However, not much research has been conducted\nto address the problem by revising the recurrent structure itself.\nIn this paper, we aim to address these two problems by proposing both a benchmark and a new\nmodel for precipitation nowcasting. For the new benchmark, we build the HKO-7 dataset which\ncontains radar echo data from 2009 to 2015 near Hong Kong. Since the radar echo maps arrive in\na stream in the real-world scenario, the nowcasting algorithms can adopt online learning to adapt\nto the newly emerging patterns dynamically. To take into account this setting, we use two testing\nprotocols in our benchmark: the of\ufb02ine setting in which the algorithm can only use a \ufb01xed window\nof the previous radar echo maps and the online setting in which the algorithm is free to use all the\nhistorical data and any online learning algorithm. Another issue for the precipitation nowcasting\ntask is that the proportions of rainfall events at different rain-rate thresholds are highly imbalanced.\nHeavier rainfall occurs less often but has a higher real-world impact. We thus propose the Balanced\nMean Squared Error (B-MSE) and Balanced Mean Absolute Error (B-MAE) measures for training\nand evaluation, which assign more weights to heavier rainfalls in the calculation of MSE and MAE.\nWe empirically \ufb01nd that the balanced variants of the loss functions are more consistent with the\noverall nowcasting performance at multiple rain-rate thresholds than the original loss functions.\nMoreover, our experiments show that training with the balanced loss functions is essential for deep\nlearning models to achieve good performance at higher rain-rate thresholds. For the new model, we\npropose the Trajectory Gated Recurrent Unit (TrajGRU) model which uses a subnetwork to output the\nstate-to-state connection structures before state transitions. TrajGRU allows the state to be aggregated\nalong some learned trajectories and thus is more \ufb02exible than the Convolutional GRU (ConvGRU) [2]\nwhose connection structure is \ufb01xed. We show that TrajGRU outperforms ConvGRU, Dynamic Filter\nNetwork (DFN) [3] as well as 2D and 3D Convolutional Neural Networks (CNNs) [20, 27] in both a\nsynthetic MovingMNIST++ dataset and the HKO-7 dataset.\nUsing the new dataset, testing protocols, training loss and model, we provide extensive empirical\nevaluation of seven models, including a simple baseline model which always predicts the last frame,\ntwo optical \ufb02ow based models (ROVER and its nonlinear variant), and four representative deep\nlearning models (TrajGRU, ConvGRU, 2D CNN, and 3D CNN). We also provide a large-scale\n\n2\n\n\fbenchmark for precipitation nowcasting. Our experimental validation shows that (1) all the deep\nlearning models outperform the optical \ufb02ow based models, (2) TrajGRU attains the best overall\nperformance among all the deep learning models, and (3) after applying online \ufb01ne-tuning, the models\ntested in the online setting consistently outperform those in the of\ufb02ine setting. To the best of our\nknowledge, this is the \ufb01rst comprehensive benchmark of deep learning models for the precipitation\nnowcasting problem. Besides, since precipitation nowcasting can be viewed as a video prediction\nproblem [22, 27], our work is the \ufb01rst to provide evidence and justi\ufb01cation that online learning could\npotentially be helpful for video prediction in general.\n\n2 Related Work\n\nDeep learning for precipitation nowcasting and video prediction For the precipitation nowcast-\ning problem, the re\ufb02ectivity factors in radar echo maps are \ufb01rst transformed to grayscale images\nbefore being fed into the prediction algorithm [23]. Thus, precipitation nowcasting can be viewed\nas a type of video prediction problem with a \ufb01xed \u201ccamera\u201d, which is the weather radar. Therefore,\nmethods proposed for predicting future frames in natural videos are also applicable to precipitation\nnowcasting and are related to our paper. There are three types of general architecture for video\nprediction: RNN based models, 2D CNN based models, and 3D CNN based models. Ranzato et\nal. [22] proposed the \ufb01rst RNN based model for video prediction, which uses a convolutional RNN\nwith 1 \u00d7 1 state-state kernel to encode the observed frames. Srivastava et al. [24] proposed the LSTM\nencoder-decoder network which uses one LSTM to encode the input frames and another LSTM to\npredict multiple frames ahead. The model was generalized in [23] by replacing the fully-connected\nLSTM with ConvLSTM to capture the spatiotemporal correlations better. Later, Finn et al. [5] and De\nBrabandere et al. [3] extended the model in [23] by making the network predict the transformation of\nthe input frame instead of directly predicting the raw pixels. Ruben et al. [26] proposed to use both an\nRNN that captures the motion and a CNN that captures the content to generate the prediction. Along\nwith RNN based models, 2D and 3D CNN based models were proposed in [20] and [27] respectively.\nMathieu et al. [20] treated the frame sequence as multiple channels and applied 2D CNN to generate\nthe prediction while [27] treated them as the depth and applied 3D CNN. Both papers show that\nGenerative Adversarial Network (GAN) [6] is helpful for generating sharp predictions.\n\nStructured recurrent connection for spatiotemporal modeling From a higher-level perspective,\nprecipitation nowcasting and video prediction are intrinsically spatiotemporal sequence forecasting\nproblems in which both the input and output are spatiotemporal sequences [23]. Recently, there is\na trend of replacing the fully-connected structure in the recurrent connections of RNN with other\ntopologies to enhance the network\u2019s ability to model the spatiotemporal relationship. Other than the\nConvLSTM which replaces the full-connection with convolution and is designed for dense videos, the\nSocialLSTM [1] and the Structural-RNN (S-RNN) [11] have been proposed sharing a similar notion.\nSocialLSTM de\ufb01nes the topology based on the distance between different people and is designed for\nhuman trajectory prediction while S-RNN de\ufb01nes the structure based on the given spatiotemporal\ngraph. All these models are different from our TrajGRU in that our model actively learns the recurrent\nconnection structure. Liang et al. [17] have proposed the Structure-evolving LSTM, which also has the\nability to learn the connection structure of RNNs. However, their model is designed for the semantic\nobject parsing task and learns how to merge the graph nodes automatically. It is thus different from\nTrajGRU which aims at learning the local correlation structure for spatiotemporal data.\n\nBenchmark for video tasks There exist benchmarks for several video tasks like online object\ntracking [29] and video object segmentation [21]. However, there is no benchmark for the precipitation\nnowcasting problem, which is also a video task but has its unique properties since radar echo map is\na completely different type of data and the data is highly imbalanced (as mentioned in Section 1).\nThe large-scale benchmark created as part of this work could help \ufb01ll the gap.\n\n3 Model\n\nIn this section, we present our new model for precipitation nowcasting. We \ufb01rst introduce the general\nencoding-forecasting structure used in this paper. Then we review the ConvGRU model and present\nour new TrajGRU model.\n\n3\n\n\f3.1 Encoding-forecasting Structure\n\nt ,H2\n\nt , . . . ,Hn\n\nWe adopt a similar formulation of the precipitation nowcasting problem as in [23]. Assume that the\nradar echo maps form a spatiotemporal sequence (cid:104)I1,I2, . . .(cid:105). At a given timestamp t, our model gen-\nerates the most likely K-step predictions, \u02c6It+1, \u02c6It+2, . . . , \u02c6It+K, based on the previous J observations\nincluding the current one: It\u2212J+1,It\u2212J+2, . . . ,It. Our encoding-forecasting network \ufb01rst encodes\nthe observations into n layers of RNN states: H1\nt = h(It\u2212J+1,It\u2212J+2, . . . ,It), and\nthen uses another n layers of RNNs to generate the predictions based on these encoded states:\n\u02c6It+1, \u02c6It+2, . . . , \u02c6It+K = g(H1\nt ). Figure 1 illustrates our encoding-forecasting structure\nfor n = 3, J = 2, K = 2. We insert downsampling and upsampling layers between the RNNs, which\nare implemented by convolution and deconvolution with stride. The reason to reverse the order of the\nforecasting network is that the high-level states, which have captured the global spatiotemporal repre-\nsentation, could guide the update of the low-level states. Moreover, the low-level states could further\nin\ufb02uence the prediction. This structure is more reasonable than the previous structure [23] which does\nnot reverse the link of the forecasting network because we are free to plug in additional RNN layers\non top and no skip-connection is required to aggregate the low-level information. One can choose any\ntype of RNNs like ConvGRU or our newly proposed TrajGRU in this general encoding-forecasting\nstructure as long as their states correspond to tensors.\n\nt , . . . ,Hn\n\nt ,H2\n\n3.2 Convolutional GRU\n\nThe main formulas of the ConvGRU used in this paper are given as follows:\n\nZt = \u03c3(Wxz \u2217 Xt + Whz \u2217 Ht\u22121),\nRt = \u03c3(Wxr \u2217 Xt + Whr \u2217 Ht\u22121),\nH(cid:48)\nt = f (Wxh \u2217 Xt + Rt \u25e6 (Whh \u2217 Ht\u22121)),\nHt = (1 \u2212 Zt) \u25e6 H(cid:48)\n\nt + Zt \u25e6 Ht\u22121.\n\n(1)\n\nThe bias terms are omitted for notational simplicity. \u2018\u2217\u2019 is the convolution operation and \u2018\u25e6\u2019 is the\nHadamard product. Here, Ht,Rt,Zt,H(cid:48)\nt \u2208 RCh\u00d7H\u00d7W are the memory state, reset gate, update gate,\nand new information, respectively. Xt \u2208 RCi\u00d7H\u00d7W is the input and f is the activation, which is\nchosen to be leaky ReLU with negative slope equals to 0.2 [18] througout the paper. H, W are the\nheight and width of the state and input tensors and Ch, Ci are the channel sizes of the state and input\ntensors, respectively. Every time a new input arrives, the reset gate will control whether to clear the\nprevious state and the update gate will control how much the new information will be written to the\nstate.\n\n3.3 Trajectory GRU\n\nWhen used for capturing spatiotemporal correlations, the de\ufb01ciency of ConvGRU and other\nConvRNNs is that the connection structure and weights are \ufb01xed for all the locations. The convolution\noperation basically applies a location-invariant \ufb01lter to the input. If the inputs are all zero and the\nreset gates are all one, we could rewrite the computation process of the new information at a speci\ufb01c\nlocation (i, j) at timestamp t, i.e, H(cid:48)\n\nt,:,i,j, as follows:\n\nH(cid:48)\nt,:,i,j = f (Whhconcat((cid:104)Ht\u22121,:,p,q | (p, q) \u2208 N h\n\ni,j(cid:105))) = f (\n\nWl\n\nhhHt\u22121,:,pl,i,j ,ql,i,j ).\n\n(2)\n\nHere, N h\ni,j is the ordered neighborhood set at location (i, j) de\ufb01ned by the hyperparameters of the\nstate-to-state convolution such as kernel size, dilation and padding [30]. (pl,i,j, ql,i,j) is the lth\nneighborhood location of position (i, j). The concat(\u00b7) function concatenates the inner vectors in the\nset and Whh is the matrix representation of the state-to-state convolution weights.\nAs the hyperparameter of convolution is \ufb01xed, the neighborhood set N h\ni,j stays the same for all\nlocations. However, most motion patterns have different neighborhood sets for different locations.\nFor example, rotation and scaling generate \ufb02ow \ufb01elds with different angles pointing to different\ndirections. It would thus be more reasonable to have a location-variant connection structure as\n\nH(cid:48)\nt,:,i,j = f (\n\nWl\n\nhhHt\u22121,:,pl,i,j (\u03b8),ql,i,j (\u03b8)),\n\n(3)\n\ni,j|(cid:88)\n\n|N h\n\nl=1\n\nL(cid:88)\n\nl=1\n\n4\n\n\f(a) For convolutional RNN, the recurrent\nconnections are \ufb01xed over time.\n\nFigure 1: Example of the encoding-forecasting structure used\nin the paper. In the \ufb01gure, we use three RNNs to predict two\nfuture frames \u02c6I3, \u02c6I4 given the two input frames I1, I2. The spatial\ncoordinates G are concatenated to the input frame to ensure the\nnetwork knows the observations are from different locations. The\nRNNs can be either ConvGRU or TrajGRU. Zeros are fed as input\nto the RNN if the input link is missing.\n\n(b) For trajectory RNN, the recurrent con-\nnections are dynamically determined.\nFigure 2: Comparison of the connection\nstructures of convolutional RNN and tra-\njectory RNN. Links with the same color\nshare the same transition weights. (Best\nviewed in color)\n\nwhere L is the total number of local links, (pl,i,j(\u03b8), ql,i,j(\u03b8)) is the lth neighborhood parameterized\nby \u03b8.\nBased on this observation, we propose the TrajGRU, which uses the current input and previous\nstate to generate the local neighborhood set for each location at each timestamp. Since the location\nindices are discrete and non-differentiable, we use a set of continuous optical \ufb02ows to represent these\n\u201cindices\u201d. The main formulas of TrajGRU are given as follows:\n\nUt,Vt = \u03b3(Xt,Ht\u22121),\nZt = \u03c3(Wxz \u2217 Xt +\n\nL(cid:88)\nL(cid:88)\n\nl=1\n\nW l\n\nhz \u2217 warp(Ht\u22121,Ut,l,Vt,l)),\n\nW l\n\nRt = \u03c3(Wxr \u2217 Xt +\n\nhr \u2217 warp(Ht\u22121,Ut,l,Vt,l)),\nL(cid:88)\nW l\nH(cid:48)\nt = f (Wxh \u2217 Xt + Rt \u25e6 (\nt + Zt \u25e6 Ht\u22121.\nHt = (1 \u2212 Zt) \u25e6 H(cid:48)\n\nl=1\n\nl=1\n\nhh \u2217 warp(Ht\u22121,Ut,l,Vt,l))),\n\n(4)\n\nH(cid:88)\n\nW(cid:88)\n\nHere, L is the total number of allowed links. Ut,Vt \u2208 RL\u00d7H\u00d7W are the \ufb02ow \ufb01elds that store the\nlocal connection structure generated by the structure generating network \u03b3. The W l\nhr,W l\nare the weights for projecting the channels, which are implemented by 1 \u00d7 1 convolutions. The\nwarp(Ht\u22121,Ut,l,Vt,l) function selects the positions pointed out by Ut,l,Vt,l from Ht\u22121 via the\nbilinear sampling kernel [10, 9]. If we denote M = warp(I, U, V) where M,I \u2208 RC\u00d7H\u00d7W and\nU, V \u2208 RH\u00d7W , we have:\n\nhz,W l\n\nhh\n\nMc,i,j =\n\nIc,m,n max(0, 1 \u2212 |i + Vi,j \u2212 m|) max(0, 1 \u2212 |j + Ui,j \u2212 n|).\n\n(5)\n\nm=1\n\nn=1\n\nThe advantage of such a structure is that we could learn the connection topology by learning the\nparameters of the subnetwork \u03b3. In our experiments, \u03b3 takes the concatenation of Xt and Ht\u22121 as\nthe input and is \ufb01xed to be a one-hidden-layer convolutional neural network with 5 \u00d7 5 kernel size\nand 32 feature maps. Thus, \u03b3 has only a small number of parameters and adds nearly no cost to the\noverall computation. Compared to a ConvGRU with K \u00d7 K state-to-state convolution, TrajGRU\nis able to learn a more ef\ufb01cient connection structure with L < K 2. For ConvGRU and TrajGRU,\nthe number of model parameters is dominated by the size of the state-to-state weights, which is\nO(L \u00d7 C 2\nh) for ConvGRU. If L is chosen to be smaller than K 2, the\n\nh) for TrajGRU and O(K 2 \u00d7 C 2\n\n5\n\nRNNRNNRNNDownsampleDownsampleConvolutionRNNRNNRNNDownsampleDownsampleConvolutionRNNRNNRNNUpsampleUpsampleConvolutionRNNRNNRNNUpsampleUpsampleConvolutionEncoderForecaster\fTable 1: Comparison of TrajGRU and the baseline models in the MovingMNIST++ dataset. \u2018Conv-K\u03b1-D\u03b2\u2019\nrefers to the ConvGRU with kernel size \u03b1 \u00d7 \u03b1 and dilation \u03b2 \u00d7 \u03b2. \u2018Traj-L\u03bb\u2019 refers to the TrajGRU with \u03bb links.\nWe replace the output layer of the ConvGRU-K5-D1 model to get the DFN.\n\nConv-K3-D2 Conv-K5-D1 Conv-K7-D1 Traj-L5 Traj-L9 Traj-L13 TrajGRU-L17\n\nDFN\n\nConv2D Conv3D\n\n#Parameters\nTest MSE \u00d710\u22122\nStandard Deviation \u00d710\u22122\n\n2.84M\n1.495\n0.003\n\n4.77M\n1.310\n0.004\n\n8.01M\n1.254\n0.006\n\n2.60M 3.42M\n1.351\n1.247\n0.015\n0.020\n\n4.00M\n1.170\n0.022\n\n4.77M\n1.138\n0.019\n\n4.83M 29.06M 32.52M\n1.461\n1.637\n0.002\n0.002\n\n1.681\n0.001\n\nnumber of parameters of TrajGRU can also be smaller than the ConvGRU and the TrajGRU model\nis able to use the parameters more ef\ufb01ciently. Illustration of the recurrent connection structures of\nConvGRU and TrajGRU is given in Figure 2. Recently, Jeon & Kim [12] has used similar ideas to\nextend the convolution operations in CNN. However, their proposed Active Convolution Unit (ACU)\nfocuses on the images where the need for location-variant \ufb01lters is limited. Our TrajGRU focuses on\nvideos where location-variant \ufb01lters are crucial for handling motion patterns like rotations. Moreover,\nwe are revising the structure of the recurrent connection and have tested different number of links\nwhile [12] \ufb01xes the link number to 9.\n\n4 Experiments on MovingMNIST++\n\nBefore evaluating our model on the more challenging precipitation nowcasting task, we \ufb01rst compare\nTrajGRU with ConvGRU, DFN and 2D/3D CNNs on a synthetic video prediction dataset to justify\nits effectiveness.\nThe previous MovingMNIST dataset [24, 23] only moves the digits with a constant speed and is not\nsuitable for evaluating different models\u2019 ability in capturing more complicated motion patterns. We\nthus design the MovingMNIST++ dataset by extending MovingMNIST to allow random rotations,\nscale changes, and illumination changes. Each frame is of size 64 \u00d7 64 and contains three moving\ndigits. We use 10 frames as input to predict the next 10 frames. As the frames have illumination\nchanges, we use MSE instead of cross-entropy for training and evaluation 2. We train all models\nusing the Adam optimizer [14] with learning rate equal to 10\u22124 and momentum equal to 0.5. For\nthe RNN models, we use the encoding-forecasting structure introduced previously with three RNN\nlayers. All RNNs are either ConvGRU or TrajGRU and all use the same set of hyperparameters. For\nTrajGRU, we initialize the weight of the output layer of the structure generating network to zero.\nThe strides of the middle downsampling and upsampling layers are chosen to be 2. The numbers\nof \ufb01lters for the three RNNs are 64, 96, 96 respectively. For the DFN model, we replace the output\nlayer of ConvGRU with a 11 \u00d7 11 local \ufb01lter and transform the previous frame to get the prediction.\nFor the RNN models, we train them for 200,000 iterations with norm clipping threshold equal to\n1 and batch size equal to 4. For the CNN models, we train them for 100,000 iterations with norm\nclipping threshold equal to 50 and batch size equal to 32. The detailed experimental con\ufb01guration of\nthe models for the MovingMNIST++ experiment can be found in the appendix. We have also tried to\nuse conditional GAN for the 2D and 3D models but have failed to get reasonable results.\nTable 1 gives the results of different models on the same test set that contains 10,000 sequences. We\ntrain all models using three different seeds to report the standard deviation. We can \ufb01nd that TrajGRU\nwith only 5 links outperforms ConvGRU with state-to-state kernel size 3 \u00d7 3 and dilation 2 \u00d7 2 (9\nlinks). Also, the performance of TrajGRU improves as the number of links increases. TrajGRU\nwith L = 13 outperforms ConvGRU with 7 \u00d7 7 state-to-state kernel and yet has fewer parameters.\nAnother observation from the table is that DFN does not perform well in this synthetic dataset. This\nis because DFN uses softmax to enhance the sparsity of the learned local \ufb01lters, which fails to model\nillumination change because the maximum value always gets smaller after convolving with a positive\nkernel whose weights sum up to 1. For DFN, when the pixel values get smaller, it is impossible for\nthem to increase again. Figure 3 visualizes the learned structures of TrajGRU. We can see that the\nnetwork has learned reasonable local link patterns.\n\n2The MSE for the MovingMNIST++ experiment is averaged by both the frame size and the length of the\n\npredicted sequence.\n\n6\n\n\fFigure 3: Selected links of TrajGRU-L13 at different frames and layers. We choose one of the 13 links and\nplot an arrow starting from each pixel to the pixel that is referenced by the link. From left to right we display\nthe learned structure at the \ufb01rst, second and third layer of the encoder. The links displayed here have learned\nbehaviour for rotations. We sub-sample the displayed links for the \ufb01rst layer for better readability. We include\nanimations for all layers and links in the supplementary material. (Best viewed when zoomed in.)\n5 Benchmark for Precipitation Nowcasting\n\n5.1 HKO-7 Dataset\n\nThe HKO-7 dataset used in the benchmark contains radar echo data from 2009 to 2015 collected by\nHKO. The radar CAPPI re\ufb02ectivity images, which have resolution of 480\u00d7 480 pixels, are taken from\nan altitude of 2km and cover a 512km \u00d7 512km area centered in Hong Kong. The data are recorded\nevery 6 minutes and hence there are 240 frames per day. The raw logarithmic radar re\ufb02ectivity factors\nare linearly transformed to pixel values via pixel = (cid:98)255 \u00d7 dBZ+10\n70 + 0.5(cid:99) and are clipped to be\nbetween 0 and 255. The raw radar echo images generated by Doppler weather radar are noisy due to\nfactors like ground clutter, sea clutter, anomalous propagation and electromagnetic interference [16].\nTo alleviate the impact of noise in training and evaluation, we \ufb01lter the noisy pixels in the dataset and\ngenerate the noise masks by a two-stage process described in the appendix.\nAs rainfall events occur sparsely, we select the rainy days based on the rain barrel information to form\nour \ufb01nal dataset, which has 812 days for training, 50 days for validation and 131 days for testing.\nOur current treatment is close to the real-life scenario as we are able to train an additional model that\nclassi\ufb01es whether or not it will rain on the next day and applies our precipitation nowcasting model if\nthis coarser-level model predicts that it will be rainy. The radar re\ufb02ectivity values are converted to\nrainfall intensity values (mm/h) using the Z-R relationship: dBZ = 10 log a + 10b log R where R is\nthe rain-rate level, a = 58.53 and b = 1.56. The overall statistics and the average monthly rainfall\ndistribution of the HKO-7 dataset are given in the appendix.\n\n5.2 Evaluation Methodology\n\nAs the radar echo maps arrive in a stream, nowcasting algorithms can apply online learning to adapt\nto the newly emerging spatiotemporal patterns. We propose two settings in our evaluation protocol:\n(1) the of\ufb02ine setting in which the algorithm always receives 5 frames as input and predicts 20 frames\nahead, and (2) the online setting in which the algorithm receives segments of length 5 sequentially and\npredicts 20 frames ahead for each new segment received. The evaluation protocol is described more\nsystematically in the appendix. The testing environment guarantees that the same set of sequences is\ntested in both the of\ufb02ine and online settings for fair comparison.\nFor both settings, we evaluate the skill scores for multiple thresholds that correspond to different\nrainfall levels to give an all-round evaluation of the algorithms\u2019 nowcasting performance. Table 2\nshows the distribution of different rainfall levels in our dataset. We choose to use the thresholds 0.5,\n2, 5, 10, 30 to calculate the CSI and Heidke Skill Score (HSS) [8]. For calculating the skill score at a\nspeci\ufb01c threshold \u03c4, which is 0.5, 2, 5, 10 or 30, we \ufb01rst convert the pixel values in prediction and\nground-truth to 0/1 by thresholding with \u03c4. We then calculate the TP (prediction=1, truth=1), FN\n(prediction=0, truth=1), FP (prediction=1, truth=0), and TN (prediction=0, truth=0). The CSI score is\ncalculated as\n(TP+FN)(FN+TN)+(TP+FP)(FP+TN). During\nthe computation, the masked points are ignored.\n\nTP+FN+FP and the HSS score is calculated as\n\nTP\n\nTP\u00d7TN\u2212FN\u00d7FP\n\n7\n\n\fTable 2: Rain rate statistics in the HKO-7 benchmark.\n\nRain Rate (mm/h)\n0 \u2264 x < 0.5\n0.5 \u2264 x < 2\n2 \u2264 x < 5\n5 \u2264 x < 10\n10 \u2264 x < 30\n30 \u2264 x\n\nProportion (%) Rainfall Level\n\n90.25\n4.38\n2.46\n1.35\n1.14\n0.42\n\nNo / Hardly noticeable\nLight\nLight to moderate\nModerate\nModerate to heavy\nRainstorm warning\n\nAs shown in Table 2, the frequencies of different rainfall levels are highly imbalanced. We propose\nto use the weighted loss function to help solve this problem. Speci\ufb01cally, we assign a weight\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n1,\nx < 2\n2 \u2264 x < 5\n2,\n5 \u2264 x < 10\n5,\n10, 10 \u2264 x < 30\n30, x \u2265 30\n\nw(x) to each pixel according to its rainfall intensity x: w(x) =\n\n. Also, the\n\n(cid:80)N\n\nn=1\n\n(cid:80)480\n\ni=1\n\n(cid:80)480\nj=1 wn,i,j(xn,i,j \u2212 \u02c6xn,i,j)2 and B-MAE = 1\n\n(cid:80)480\nmasked pixels have weight 0. The resulting B-MSE and B-MAE scores are computed as B-MSE =\nj=1 wn,i,j|xn,i,j \u2212\n1\n\u02c6xn,i,j|, where N is the total number of frames and wn,i,j is the weight corresponding to the (i, j)th\nN\npixel in the nth frame. For the conventional MSE and MAE measures, we simply set all the weights\nto 1 except the masked points.\n\n(cid:80)480\n\n(cid:80)N\n\nN\n\nn=1\n\ni=1\n\n5.3 Evaluated Algorithms\n\nWe have evaluated seven nowcasting algorithms, including the simplest model which always predicts\nthe last frame, two optical \ufb02ow based methods (ROVER and its nonlinear variant), and four deep\nlearning methods (TrajGRU, ConvGRU, 2D CNN, and 3D CNN). Speci\ufb01cally, we have evaluated\nthe performance of deep learning models in the online setting by \ufb01ne-tuning the algorithms using\nAdaGrad [4] with learning rate equal to 10\u22124. We optimize the sum of B-MSE and B-MAE during\nof\ufb02ine training and online \ufb01ne-tuning. During the of\ufb02ine training process, all models are optimized\nby the Adam optimizer with learning rate equal to 10\u22124 and momentum equal to 0.5 and we train\nthese models with early-stopping on the sum of B-MSE and B-MAE. For RNN models, the training\nbatch size is set to 4. For the CNN models, the training batch size is set to 8. For TrajGRU and\nConvGRU models, we use a 3-layer encoding-forecasting structure with the number of \ufb01lters for the\nRNNs set to 64, 192, 192. We use kernel size equal to 5 \u00d7 5, 5 \u00d7 5, 3 \u00d7 3 for the ConvGRU models\nwhile the number of links is set to 13, 13, 9 for the TrajGRU model. We also train the ConvGRU\nmodel with the original MSE and MAE loss, which is named \u201cConvGRU-nobal\u201d, to evaluate the\nimprovement by training with the B-MSE and B-MAE loss. The other model con\ufb01gurations including\nROVER, ROVER-nonlinear and deep models are included in the appendix.\n\n5.4 Evaluation Results\n\nThe overall evaluation results are summarized in Table 3. In order to analyze the con\ufb01dence interval\nof the results, we train 2D CNN, 3D CNN, ConvGRU and TrajGRU models using three different\nrandom seeds and report the standard deviation in Table 4. We \ufb01nd that training with balanced loss\nfunctions is essential for good nowcasting performance of heavier rainfall. The ConvGRU model that\nis trained without balanced loss, which best represents the model in [23], has worse nowcasting score\nthan the optical \ufb02ow based methods at the 10mm/h and 30mm/h thresholds. Also, we \ufb01nd that all the\ndeep learning models that are trained with the balanced loss outperform the optical \ufb02ow based models.\nAmong the deep learning models, TrajGRU performs the best and 3D CNN outperforms 2D CNN,\nwhich shows that an appropriate network structure is crucial to achieving good performance. The\nimprovement of TrajGRU over the other models is statistically signi\ufb01cant because the differences in\nB-MSE and B-MAE are larger than three times their standard deviation. Moreover, the performance\nwith online \ufb01ne-tuning enabled is consistently better than that without online \ufb01ne-tuning, which\nveri\ufb01es the effectiveness of online learning at least for this task.\n\n8\n\n\fTable 3: HKO-7 benchmark result. We mark the best result within a speci\ufb01c setting with bold face and the\nsecond best result by underlining. Each cell contains the mean score of the 20 predicted frames. In the online\nsetting, all algorithms have used the online learning strategy described in the paper. \u2018\u2191\u2019 means that the score is\nhigher the better while \u2018\u2193\u2019 means that the score is lower the better. \u2018r \u2265 \u03c4\u2019 means the skill score at the \u03c4mm/h\nrainfall threshold. For 2D CNN, 3D CNN, ConvGRU and TrajGRU models, we train the models with three\ndifferent random seeds and report the mean scores.\n\nAlgorithms\n\nr \u2265 0.5\n\nr \u2265 2\n\nLast Frame\nROVER + Linear\nROVER + Non-linear\n2D CNN\n3D CNN\nConvGRU-nobal\nConvGRU\nTrajGRU\n\n2D CNN\n3D CNN\nConvGRU\nTrajGRU\n\n0.4022\n0.4762\n0.4655\n0.5095\n0.5109\n0.5476\n0.5489\n0.5528\n\n0.5112\n0.5106\n0.5511\n0.5563\n\n0.3266\n0.4089\n0.4074\n0.4396\n0.4411\n0.4661\n0.4731\n0.4759\n\n0.4363\n0.4344\n0.4737\n0.4798\n\nCSI \u2191\nr \u2265 5\n\n0.2401\n0.3151\n0.3226\n0.3406\n0.3415\n0.3526\n0.3720\n0.3751\n\n0.3364\n0.3345\n0.3742\n0.3808\n\nr \u2265 10\n\nr \u2265 30\n\n0.1574\n0.2146\n0.2164\n0.2392\n0.2424\n0.2138\n0.2789\n0.2835\n\n0.2435\n0.2427\n0.2843\n0.2914\n\n0.0692\n0.1067\n0.0951\n0.1093\n0.1185\n0.0712\n0.1776\n0.1856\n\n0.1263\n0.1299\n0.1837\n0.1933\n\nr \u2265 2\n\nr \u2265 0.5\nOf\ufb02ine Setting\n\n0.5207\n0.6038\n0.5896\n0.6366\n0.6334\n0.6756\n0.6701\n0.6731\n\n0.6365\n0.6355\n0.6712\n0.6760\n\n0.4531\n0.5473\n0.5436\n0.5809\n0.5825\n0.6094\n0.6104\n0.6126\n\n0.5756\n0.5736\n0.6105\n0.6164\n\nOnline Setting\n\nHSS \u2191\nr \u2265 5\n\n0.3582\n0.4516\n0.4590\n0.4851\n0.4862\n0.4981\n0.5163\n0.5192\n\n0.4790\n0.4766\n0.5183\n0.5253\n\nr \u2265 10\n\nr \u2265 30\n\nB-MSE \u2193 B-MAE \u2193\n\n0.2512\n0.3301\n0.3318\n0.3690\n0.3734\n0.3286\n0.4159\n0.4207\n\n0.3744\n0.3733\n0.4226\n0.4308\n\n0.1193\n0.1762\n0.1576\n0.1885\n0.2034\n0.1160\n0.2893\n0.2996\n\n0.2162\n0.2220\n0.2981\n0.3111\n\n15274\n11651\n10945\n7332\n7202\n9087\n5951\n5816\n\n6654\n6690\n5724\n5589\n\n28042\n23437\n22857\n18091\n17593\n19642\n15000\n14675\n\n17071\n16903\n14772\n14465\n\nTable 4: Con\ufb01dence intervals of selected deep models in the HKO-7 benchmark. We train 2D CNN, 3D CNN,\nConvGRU and TrajGRU using three different random seeds and report the standard deviation of the test scores.\n\nAlgorithms\n\nr \u2265 0.5\n\nr \u2265 2\n\n2D CNN\n3D CNN\nConvGRU\nTrajGRU\n\n2D CNN\n3D CNN\nConvGRU\nTrajGRU\n\n0.0032\n0.0043\n0.0022\n0.0020\n\n0.0002\n0.0004\n0.0006\n0.0008\n\n0.0023\n0.0027\n0.0018\n0.0024\n\n0.0005\n0.0003\n0.0012\n0.0004\n\nCSI\nr \u2265 5\n\n0.0015\n0.0016\n0.0031\n0.0025\n\n0.0002\n0.0002\n0.0017\n0.0002\n\nr \u2265 10\n\nr \u2265 30\n\n0.0001\n0.0024\n0.0008\n0.0031\n\n0.0002\n0.0003\n0.0019\n0.0002\n\n0.0025\n0.0024\n0.0022\n0.0031\n\n0.0012\n0.0008\n0.0024\n0.0002\n\nr \u2265 2\n\nr \u2265 0.5\nOf\ufb02ine Setting\n\n0.0032\n0.0042\n0.0022\n0.0019\n\n0.0002\n0.0004\n0.0006\n0.0007\n\n0.0025\n0.0028\n0.0021\n0.0024\n\n0.0005\n0.0004\n0.0012\n0.0004\n\nOnline Setting\n\nHSS\nr \u2265 5\n\n0.0018\n0.0018\n0.0040\n0.0028\n\n0.0002\n0.0003\n0.0019\n0.0002\n\nr \u2265 10\n\nr \u2265 30\n\nB-MSE\n\nB-MAE\n\n0.0003\n0.0031\n0.0010\n0.0039\n\n0.0003\n0.0004\n0.0023\n0.0002\n\n0.0043\n0.0041\n0.0038\n0.0045\n\n0.0019\n0.0001\n0.0031\n0.0003\n\n90\n44\n52\n18\n\n12\n23\n30\n10\n\n95\n26\n81\n32\n\n12\n27\n69\n20\n\nTable 5: Kendall\u2019s \u03c4 coef\ufb01cients between skill scores. Higher absolute value indicates stronger correlation. The\nnumbers with the largest absolute values are shown in bold face.\n\nSkill Scores\n\nr \u2265 0.5\n\nr \u2265 2\n\nMSE\nMAE\nB-MSE\nB-MAE\n\n-0.24\n-0.41\n-0.70\n-0.74\n\n-0.39\n-0.57\n-0.57\n-0.59\n\nCSI\nr \u2265 5\n\n-0.39\n-0.55\n-0.61\n-0.58\n\nr \u2265 10\n\nr \u2265 30\n\nr \u2265 0.5\n\nr \u2265 2\n\n-0.07\n-0.25\n-0.86\n-0.82\n\n-0.01\n-0.27\n-0.84\n-0.92\n\n-0.33\n-0.50\n-0.62\n-0.67\n\n-0.42\n-0.60\n-0.55\n-0.57\n\nHSS\nr \u2265 5\n\n-0.39\n-0.55\n-0.61\n-0.59\n\nr \u2265 10\n\nr \u2265 30\n\n-0.06\n-0.24\n-0.86\n-0.83\n\n0.01\n-0.26\n-0.84\n-0.92\n\nBased on the evaluation results, we also compute the Kendall\u2019s \u03c4 coef\ufb01cients [13] between the MSE,\nMAE, B-MSE, B-MAE and the CSI, HSS at different thresholds. As shown in Table 5, B-MSE and\nB-MAE have stronger correlation with the CSI and HSS in most cases.\n\n6 Conclusion and Future Work\n\nIn this paper, we have provided the \ufb01rst large-scale benchmark for precipitation nowcasting and have\nproposed a new TrajGRU model with the ability of learning the recurrent connection structure. We\nhave shown TrajGRU is more ef\ufb01cient in capturing the spatiotemporal correlations than ConvGRU.\nFor future work, we plan to test if TrajGRU helps improve other spatiotemporal learning tasks like\nvisual object tracking and video segmentation. We will also try to build an operational nowcasting\nsystem using the proposed algorithm.\n\n9\n\n\fAcknowledgments\n\nThis research has been supported by General Research Fund 16207316 from the Research Grants\nCouncil and Innovation and Technology Fund ITS/205/15FP from the Innovation and Technology\nCommission in Hong Kong. The \ufb01rst author has also been supported by the Hong Kong PhD\nFellowship.\n\nReferences\n\n[1] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio\n\nSavarese. Social LSTM: Human trajectory prediction in crowded spaces. In CVPR, 2016.\n\n[2] Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. Delving deeper into convolutional networks for\n\nlearning video representations. In ICLR, 2016.\n\n[3] Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic \ufb01lter networks. In NIPS, 2016.\n[4] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\n[5] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through\n\nvideo prediction. In NIPS, 2016.\n\n[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[7] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780,\n\n1997.\n\n[8] Robin J Hogan, Christopher AT Ferro, Ian T Jolliffe, and David B Stephenson. Equitability revisited: Why\n\nthe \u201cequitable threat score\u201d is not equitable. Weather and Forecasting, 25(2):710\u2013726, 2010.\n\n[9] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox.\n\nFlownet 2.0: Evolution of optical \ufb02ow estimation with deep networks. In CVPR, 2017.\n\n[10] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In NIPS, 2015.\n[11] Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. Structural-RNN: Deep learning on\n\nspatio-temporal graphs. In CVPR, 2016.\n\n[12] Yunho Jeon and Junmo Kim. Active convolution: Learning the shape of convolution for image classi\ufb01cation.\n\nIn CVPR, 2017.\n\n[13] Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81\u201393, 1938.\n[14] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n[16] Hansoo Lee and Sungshin Kim. Ensemble classi\ufb01cation for anomalous propagation echo detection with\n\nclustering-based subset-selection method. Atmosphere, 8(1):11, 2017.\n\n[17] Xiaodan Liang, Liang Lin, Xiaohui Shen, Jiashi Feng, Shuicheng Yan, and Eric P Xing. Interpretable\n\nstructure-evolving LSTM. In CVPR, 2017.\n\n[18] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Recti\ufb01er nonlinearities improve neural network\n\nacoustic models. In ICML, 2013.\n\n[19] John S Marshall and W Mc K Palmer. The distribution of raindrops with size. Journal of Meteorology,\n\n5(4):165\u2013166, 1948.\n\n[20] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean\n\nsquare error. In ICLR, 2016.\n\n[21] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander\nSorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In\nCVPR, 2016.\n\n[22] MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit\nChopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint\narXiv:1412.6604, 2014.\n\n[23] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo. Convolu-\n\ntional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS, 2015.\n\n[24] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video represen-\n\ntations using LSTMs. In ICML, 2015.\n\n[25] Juanzhen Sun, Ming Xue, James W Wilson, Isztar Zawadzki, Sue P Ballard, Jeanette Onvlee-Hooimeyer,\nPaul Joe, Dale M Barker, Ping-Wah Li, Brian Golding, et al. Use of NWP for nowcasting convective\n\n10\n\n\fprecipitation: Recent progress and challenges. Bulletin of the American Meteorological Society, 95(3):409\u2013\n426, 2014.\n\n[26] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and\n\ncontent for natural video sequence prediction. In ICLR, 2017.\n\n[27] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In NIPS,\n\n2016.\n\n[28] Wang-chun Woo and Wai-kin Wong. Operational application of optical \ufb02ow techniques to radar-based\n\nrainfall nowcasting. Atmosphere, 8(3):48, 2017.\n\n[29] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Online object tracking: A benchmark. In CVPR, 2013.\n[30] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2878, "authors": [{"given_name": "Xingjian", "family_name": "Shi", "institution": "HKUST"}, {"given_name": "Zhihan", "family_name": "Gao", "institution": "HKUST"}, {"given_name": "Leonard", "family_name": "Lausen", "institution": "HKUST"}, {"given_name": "Hao", "family_name": "Wang", "institution": "HKUST"}, {"given_name": "Dit-Yan", "family_name": "Yeung", "institution": "HKUST, Hong Kong"}, {"given_name": "Wai-kin", "family_name": "Wong", "institution": "HKO"}, {"given_name": "Wang-chun", "family_name": "WOO", "institution": "HKO"}]}