{"title": "Seeing the Wind: Visual Wind Speed Prediction with a Coupled Convolutional and Recurrent Neural Network", "book": "Advances in Neural Information Processing Systems", "page_first": 8735, "page_last": 8745, "abstract": "Wind energy resource quantification, air pollution monitoring, and weather forecasting all rely on rapid, accurate measurement of local wind conditions. Visual observations of the effects of wind---the swaying of trees and flapping of flags, for example---encode information regarding local wind conditions that can potentially be leveraged for visual anemometry that is inexpensive and ubiquitous. Here, we demonstrate a coupled convolutional neural network and recurrent neural network architecture that extracts the wind speed encoded in visually recorded flow-structure interactions of a flag and tree in naturally occurring wind. Predictions for wind speeds ranging from 0.75-11 m/s showed agreement with measurements from a cup anemometer on site, with a root-mean-squared error approaching the natural wind speed variability due to atmospheric turbulence. Generalizability of the network was demonstrated by successful prediction of wind speed based on recordings of other flags in the field and in a controlled wind tunnel test. Furthermore, physics-based scaling of the flapping dynamics accurately predicts the dependence of the network performance on the video frame rate and duration.", "full_text": "Seeing the Wind:\n\nVisual Wind Speed Prediction with a Coupled\nConvolutional and Recurrent Neural Network\n\nDepartment of Mechanical Engineering\n\nDepartment of Mechanical Engineering\n\nJennifer L. Cardona\n\nStanford University\nStanford, CA 94305\n\njcard27@stanford.edu\n\nMichael F. Howland\n\nStanford University\nStanford, CA 94305\n\nmhowland@stanford.edu\n\nJohn O. Dabiri\n\nGraduate Aerospace Laboratories (GALCIT)\n\nand Mechanical Engineering\n\nCalifornia Institute of Technology\n\nPasadena, CA 91125\n\njodabiri@caltech.edu\n\nAbstract\n\nWind energy resource quanti\ufb01cation, air pollution monitoring, and weather fore-\ncasting all rely on rapid, accurate measurement of local wind conditions. Visual\nobservations of the effects of wind\u2014the swaying of trees and \ufb02apping of \ufb02ags, for\nexample\u2014encode information regarding local wind conditions that can potentially\nbe leveraged for visual anemometry that is inexpensive and ubiquitous. Here, we\ndemonstrate a coupled convolutional neural network and recurrent neural network\narchitecture that extracts the wind speed encoded in visually recorded \ufb02ow-structure\ninteractions of a \ufb02ag and tree in naturally occurring wind. Predictions for wind\nspeeds ranging from 0.75-11 m/s showed agreement with measurements from a cup\nanemometer on site, with a root-mean-squared error approaching the natural wind\nspeed variability due to atmospheric turbulence. Generalizability of the network\nwas demonstrated by successful prediction of wind speed based on recordings of\nother \ufb02ags in the \ufb01eld and in a controlled wind tunnel test. Furthermore, physics-\nbased scaling of the \ufb02apping dynamics accurately predicts the dependence of the\nnetwork performance on the video frame rate and duration.\n\n1\n\nIntroduction\n\nThe ability to accurately measure wind speeds is important across several applications including\nlocating optimal sites for wind turbines, estimating pollution dispersion, and storm tracking. Currently,\ntaking these measurements requires installing a physical instrument at the exact location of interest,\nwhich can be cost prohibitive and in some cases unfeasible. Knowledge of the wind resource in cities\nis of particular interest as urbanization draws a larger portion of the world\u2019s population to such areas,\ndriving the need for more distributed energy generation closer to densely populated regions [20].\nThere is also burgeoning interest in the use of drones for delivery, which would greatly bene\ufb01t from\ninstantaneous knowledge of the local wind conditions to minimize energy consumption and ensure\nsafety [34]. Here we demonstrate a technique that enables wind speed measurements to be made\nfrom a video of a \ufb02apping \ufb02ag or swaying tree. This facilitates visual anemometry using pre-existing\nfeatures in an environment, which would be non-intrusive and cost effective for wind mapping.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe \ufb02ow-structure interaction between an object and the wind encodes information about the wind\nspeed. Neural networks can potentially be used to decode this information. Here, we leverage\nmachine learning to predict wind speeds based on these interactions. The general approach to the\ncurrent problem uses a convolutional neural network (CNN) as a feature extractor on each frame\nin a sequence, followed by a recurrent neural network (RNN) taking in the features extracted from\nthe time series of frames. The input to our algorithm is a two-second video clip (a sequence of 30\nimages), and the output is a wind speed prediction in meters per second (m/s).\nThis visual anemometry technique has the potential to signi\ufb01cantly decrease the cost and time required\nfor mapping wind resources. Installing an anemometer to monitor a single location typically costs\nO($1, 000), and even then only offers measurements at one location. While we have installed \ufb02ags\nand trees at a \ufb01eld site to collect initial training and test data, the application of this method would\noccur using pre-existing structures in the environment of interest. Therefore, the only cost of this\nmethod is in the camera recording device (a standard camera phone provides suf\ufb01cient resolution).\nHence, the cost of this method is dramatically lower per measurement point, and the barrier posed by\nthe time and labor required to install an anemometer at a location is removed.\n\n2 Related Work\n\nThe innovation proposed in this study is to use videos observing \ufb02ow-structure interactions to make\nwind speed predictions without using any classical meteorological measurements as inputs. This\nenables wind speed prediction in a much broader range of physical environments, especially those\nwhere meteorological sensors would be expensive or impractical to install. Classic methods of using\nvisual cues to estimate wind speeds include the Beaufort scale, which provides a rough estimate\nbased on human perception of the surrounding environment, and the Griggs-Putnam Index [36],\nwhich relies on the angle of plant growth to estimate annual average speeds. With recent advances in\nmachine learning, the present work seeks to extend the capabilities of visual wind speed measurement\nto provide automated and quantitative real-time measurements.\nExtracting physical quantities from videos has become increasingly prevalent. Several studies have\nused images or videos to estimate material properties of objects [37, 25], and speci\ufb01cally for cloth\n[6, 11, 39]. Video inputs have also been used to predict dynamics of objects [27], and physical\nproperties of \ufb02uid \ufb02ows [33, 30]. Estimating model parameters for physical simulations using\nsimilarity comparisons to video data has also shown promise in approximating static and dynamic\nproperties of hanging cloth [5], the masses of colliding objects [38], and most recently, wind velocity\nand material properties given a \ufb02apping \ufb02ag [29]. The success in this type of parameter estimation\nspeaks to the potential for computer vision to be used in determining physical quantities.\nThere has long been interest in the use of neural networks in predicting future wind speeds based on\nhistorical measurements [26]. Much of the work in this area has been focused on wind forecasting\nusing time series of measurements from existing instrumentation or weather forecast data as inputs\n[4, 23, 7, 13, 3, 9, 24]. Deep learning has recently been employed for meteorological precipitation\nnowcasting, where the authors used spatiotemporal radar echo data to make short-term rainfall\nintensity predictions using a convolutional LSTM [31].\nRecent work has shown success in classifying actions and motion in video clips using deep networks.\nThe present work aimed at wind speed regression from videos draws inspiration from previous\nstudies aimed at classifying videos. In the deep learning era, a number of approaches to video\nclassi\ufb01cation have arisen using deep networks. Three notable strategies that have taken hold include\n3D convolutional networks over short clips [2, 19, 22, 35], two-stream networks aiming to extract\ncontext and motion information from separate streams [32], and the combination of 2D convolutional\nnetworks with subsequent recurrent layers to treat multiple video frames as a time series [12, 41].\nCarreia et al. performed a comparison of these different approaches to video classi\ufb01cation [8]. This\nstudy will employ a strategy using a 2D CNN followed by long short-term memory (LSTM) layers.\nThis approach leverages transfer learning on a CNN to extract features related to the instantaneous\nstate of a \ufb02ag, and a RNN to analyze the wind-induced \ufb02apping motion. Further discussion of this\narchitecture choice can be found in Section 4.\n\n2\n\n\fFigure 1: Examples of cropped video frame inputs for (a) the training/validation \ufb02ag and adjacent\ntest \ufb02ag (b) the training/validation tree, and (c) the tunnel test \ufb02ag.\n\n3 Dataset\n\nThe main dataset used for training and validation consisted of videos taken at a \ufb01eld site in Lancaster,\nCA over the course of 20 days during August 2018. Only videos between the hours of 7:00 a.m.\nand 6:00 p.m. were used in order to ensure daylight conditions. Videos captured the motion of a\nstandard checkerboard \ufb02ag with an aspect ratio of 5:3 and size of 1.5 m \u00d7 0.9 m mounted at 3 m\nheight (Figure 1a), as well as the canopy of a young southern magnolia tree (Magnolia grandi\ufb02ora)\nof approximately 5 m height (Figure 1b). Videos were recorded at 15 frames per second. The videos\nwere subsequently separated into two-second clips (30 sequential frames), which were formatted as\nRGB images cropped to 224\u00d7 224 pixels. Thus, each sample used as a model input consists of a time\nseries of images of total size 30 \u00d7 224 \u00d7 224 \u00d7 3. Ground truth 1-minute average wind speed labels\nwere provided by an anemometer on site at 10 m height. The 1-minute averaging time was chosen\ninstead of a shorter averaging time because of the highly turbulent and variable \ufb02ow. The anemometer\nand \ufb02ag are spatially separated, so the instantaneous measurements made by the anemometer do not\ncorrespond to exact instantaneous speeds experienced by the \ufb02ag. Each image sequence was matched\nto a wind speed label using the timestamp of the \ufb01rst image in the series.\nThe measured 1-minute averaged wind speeds ranged from 0-15.5 m/s. The natural distribution of\nwind speeds was not uniform over the time-period of data collection, with many more samples in\nthe middle speed ranges than at the tails. Since the desired output of the model was predictions\nover a broad range of wind speeds, a more uniform distribution was preferable. To achieve this, the\nground truth wind speeds were binned in 0.25 m/s increments, and for each bin, excess samples were\nexcluded to retain a more even distribution over the range of wind speeds. Clips were chosen at\nrandom for the training/validation split. The resulting training set contained 13, 365 clips (10, 490\n\ufb02ag clips and 2, 875 tree clips). The validation sets for the \ufb02ag and tree contained 4, 091 and 2, 875\nclips respectively. Wind speed distributions for each dataset are provided in the Supplementary\nMaterials document.\nTo asses the generalizability of the network, two test sets were collected containing videos of new\n\ufb02ags in additional locations: a \ufb01eld test set (called the adjacent \ufb02ag test set), and a wind tunnel test\nset (called the tunnel test set).\n\nAdjacent Flag Test Set The adjacent \ufb02ag test set comprised clips of a \ufb02ag identical to the one\nused for training/validation. The test \ufb02ag was located 3 m from the training/validation \ufb02ag (Figure\n1c). Clips were taken at the same time as clips from the validation set to allow for direct comparison\nbetween validation results and test results on this new \ufb02ag. Although the timestamps and wind\nspeed labels are identical between the validation set and the adjacent \ufb02ag test set, the precise wind\n\n3\n\n(a)(b)(c)10 cm1 mAdjacent Flag Test CropTrain/Val. Flag Crop1 mTunnel Flag Crop10 cmTrain/Val. Tree Crop1 m\fFigure 2: Schematic of the model architecture. The CNN is an pre-trained ResNet-18 architecture\n[14]. The LSTM is a many-to-one architecture with 2 layers each containing 1000 hidden units.\n\nconditions and corresponding \ufb02ag motion differ between the two, as turbulence is chaotic and variable\nin space.\n\nTunnel Test Set The tunnel test set consists of clips taken of a smaller checkered \ufb02ag mounted\nin a wind tunnel (Figure 1d). The tunnel \ufb02ag had the same 5:3 aspect ratio as the other two \ufb02ags\ndiscussed, but was 0.37 m \u00d7 0.22 m in size. The wind tunnel was run at three speeds: 4.46 \u00b1 0.45\nm/s , 5.64 \u00b1 0.45 m/s, 6.58 \u00b1 0.45 m/s. At each speed, 600 two-second clips were recorded at 15\nframes per second, yielding 1, 800 tunnel test samples. Although these videos were recorded on\na monochrome camera, they were converted to 3-channel images by repeating the grayscale pixel\nvalues for each of the three channels for use in the model.\nThe \ufb01nal datasets used in this work are available at https://purl.stanford.edu/ph326kh0190.\n\n4 Methods\n\n4.1 Feature Extraction with ResNet-18\nBefore analyzing a video clip as a time series, each individual 224 \u00d7 224 \u00d7 3 frame was fed through\na 2D CNN to extract relevant features. The ResNet-18 architecture was chosen for the CNN because\nof its proven accuracy on previous tasks and relatively low computational cost [14]. Pre-trained\nweights for ResNet-18 were used in the current implementation to take advantage of transfer learning,\navailable through the MathWorks Deep Learning Toolbox [18].\nSince the purpose of the CNN here is feature extraction rather than image classi\ufb01cation, the last\ntwo layers of the ResNet-18 (the fully connected layer and the softmax output layer) were removed\nso that the resulting output for each frame was a 7 \u00d7 7 \u00d7 512 feature map. Since the activation\nfunction for the \ufb01nal layer was a recti\ufb01ed linear unit (ReLU(x) = max(0, x)), many of the resulting\nfeatures were zero. Therefore, to reduce memory constraints, the output features were fed through an\nadditional maximum pooling layer with a \ufb01lter size of 3 and a stride of 2. This acts to downsample the\nfeatures and reduces the number of zeros in the dataset, reducing the feature map size to 3 \u00d7 3 \u00d7 512\nfor each image, which was then \ufb02attened to 4, 608 \u00d7 1. This resulted in a 4, 608 \u00d7 30 output for each\ntwo-second (30 frame) clip to be used as in input for the recurrent network.\n\n4.2 LSTM With and Without Mean Subtracted Inputs\n\nA RNN was selected in order to learn temporal features of the videos. The \ufb02apping of \ufb02ags is\nbroadband, containing a wide range of spectral scales [1]. Typically, \ufb02ags are located in the turbulent\natmospheric boundary layer, where the length scales which govern the \ufb02ow vary from the order\nof kilometers to the order of micrometers. As a result, the associated time scales will range from\n\n4\n\n\u2026\u0ddc\ud835\udc66\ud835\udc610\ud835\udc611\ud835\udc6129CNNCNNCNNLSTMLSTMLSTMLSTMLSTMLSTM\u2026\u2026\u2026\fTable 1: Final hyperparameter choices for LSTM networks.\n\nHyperparameter\n# LSTM layers\n# hidden units per LSTM layer\nlearning rate\n\nChosen Value\n2\n1000\n0.01\n\nmilliseconds to minutes. Therefore, the architecture chosen for this application should adapt to the\nvariable spectral composition of the \ufb02ow \ufb01eld, which is captured by the motion of the \ufb02apping \ufb02ag.\nThe long short-term memory (LSTM) network was chosen for this application. It has been shown\nthat the LSTM has the capability to learn interactions over a range of scales in a series [21], as well\nas advantages in training over longer time series [16], making it a suitable choice for this application.\nA generic LSTM cell is computed with the input, forget, output, and gate gates:\n\n\uf8ee\uf8ef\uf8f0 i\n\nf\no\ng\n\n(cid:20)ht\u22121\n\n(cid:21)\n\nxt\n\n,\n\n\uf8f9\uf8fa\uf8fb =\n\n\uf8ee\uf8ef\uf8f0 \u03c3\n\n\u03c3\n\u03c3\n\n\uf8f9\uf8fa\uf8fb W\n\ntanh\n\n(1)\n\n(2)\n\n(3)\n\nand the cell and hidden states are computed as Equations 2 and 3 respectively [16]:\n\nct = f (cid:12) ct\u22121 + i (cid:12) g\n\nht = o (cid:12) tanh(ct).\n\nThe weight matrix W contains the learnable parameters. The sigmoid function, \u03c3, is given by\n\u03c3(x) = ex/(ex + 1), and tanh(x) = (ex \u2212 e\u2212x)/(ex + e\u2212x). The LSTM is more easily trained on\nlong sequences compared to standard RNNs because it is not susceptible to the problem of vanishing\ngradients, which arises due to successive multiplication by W during backpropagation through a\nstandard RNN [15]. In a LSTM network, the cell state allows for uninterrupted gradient \ufb02ow between\nmemory cells during backpropagation, as it requires only multiplication by f rather than by W .\nAs discussed in Section 4.1, the inputs to the LSTM network are obtained from the features extracted\nfrom the pre-trained ResNet-18 network. Since wind conditions are in\ufb02uenced by the diurnal cycle\nand other weather conditions [10], it is plausible that a model could use features present in the video\nclips other than the motion of the objects (e.g. the position of the sun, presence of clouds). To study\nand avoid such artifacts from over-\ufb01tting to background conditions, two experiments were run using\nthe same network architecture and hyperparameters, but trained using different inputs, referred to as\nLSTM-NM (short for no-mean) and LSTM-raw respectively.\n\nLSTM-NM In this experiment, the temporal mean of each feature over the 30-frame clip was\nsubtracted from the inputs to avoid \ufb01tting to background features. These mean-subtracted feature\nmaps served as inputs for the LSTM network.\n\nLSTM-raw Here, a second model was trained using the raw features extracted from the ResNet-18\nmodel without mean subtraction. The main purpose of this experiment is to identify whether removing\nthe temporal mean from a sequence is bene\ufb01cial for model generalizability to new locations, and to\ncon\ufb01rm that it is in fact the object motion that is used for predictions.\nThe LSTM architecture used here is many-to-one, since we have a series of 30 images being fed\ninto the LSTM network with only one regression prediction being made. A schematic of the overall\narchitecture is shown in Figure 2. Hyperparameters were chosen based on values used for other\nspatiotemporal tasks with a similar model architecture in the literature [40, 41, 12]. The \ufb01nal size of\nthe LSTM network was chosen to be 2 layers with 1000 hidden units per layer. Two smaller models\nwere also considered (1 layer with 10 hidden units, and 1 layer with 100 hidden units), but these\nmodels suffered from high bias, and were under-\ufb01tting the training set. A summary of the chosen\nhyperparameters is shown in Table 1.\n\n5\n\n\f4.3\n\nImplementation Details\n\nThis problem is framed as a regression, with a regression output layer that allows for the model to\npredict any wind speed as opposed to a speci\ufb01c class. The mean-squared error was used as the loss\nfunction, de\ufb01ned as:\n\n(cid:80)N\ni (yi \u2212 \u02c6yi)2\n\nN\n\nL =\n\n,\n\n(4)\n\nwhere N is the number of training examples, yi is the wind speed label, and \u02c6yi is the predicted wind\nspeed label for the given training example. Mean-squared error was chosen over mean absolute error\nto more heavily penalize outliers, which are particularly undesirable in applications related to wind\nenergy due to the cubic dependence of wind power on wind speed.\nStochastic gradient descent with momentum was used for optimization, with a typical momentum\nparameter of 0.9 [28]. The algorithm was implemented using the MATLAB Deep Learning Toolbox\n[17]. The LSTM network was trained for 20 epochs using minibatches of 256 samples. This amount\nof training was suf\ufb01cient to over-\ufb01t to the training set within the limit of natural wind speed variability\ndue to turbulence. Early stopping was employed for regularization. Computations were performed on\na single CPU.\n\n5 Results and Discussion\n\nTable 2: Error metrics for validation and test cases. \u2018Overall RMSE\u2019 indicates the RMSE over all\nwind speeds, and \u2018Measurable Range RMSE\u2019 refers to the RMSE for wind speeds ranging from\n0.75-11 m/s (described in Section 5.1.1).\n\nLSTM-raw RSME (m/s)\n\nLSTM-NM RMSE (m/s)\n\nDataset\nFlag Validation Set\nTree Validation Set\nAdjacent Flag Test Set\nTunnel Test Set\n\nOverall Measurable Range Overall Measurable Range\n1.37\n1.53\n2.77\nN/A\n\n1.37\n1.47\n1.85\n1.82\n\n1.27\n1.29\n3.10\n9.61\n\n1.42\n1.63\n3.02\nN/A\n\n5.1 LSTM-NM Validation Results\n\nFigure 3a shows the mean wind speed predictions for 1 m/s bins plotted against the true 1-minute\naverage wind speed labels for the validation set. The vertical error bars represent the standard\ndeviation of predictions for each bin. The horizontal error bars show the range of wind speeds\ncaptured by each bin for the \ufb01eld datasets, and the accuracy of the anemometer measurements for\nthe tunnel set. The overall root-mean-squared error (RMSE) for the \ufb02ag and tree validation set were\n1.42 m/s and 1.63 m/s respectively, which approach the natural wind variability due to atmospheric\nturbulence as will be discussed in Section 5.1.2. The mean prediction for each bin shows good\nagreement with the true labels, but the model tends to under-predict for high wind speeds (U > 11\nm/s), and over-predict for low wind speeds (U < 2 m/s). Although some of error at low wind\nspeeds might be due to a lack of training examples in that range (see the training distribution in the\nSupplementary Materials document), as discussed in detail in the next section, we found that the\nreduced accuracy at the lowest and highest wind speeds could be predicted based on knowledge\nof the physics of the \ufb02ow-structure interaction, as well as the video sample duration and temporal\nresolution.\n\n5.1.1 Measurement Limitations at High and Low Wind Speeds\n\nAt the lowest and highest wind speeds tested, the increased prediction error can be explained by an\ninability of the current dataset to capture the relevant physics necessary to measure the wind speed.\n\n6\n\n\f(a)\n\n(b)\n\nFigure 3: Mean LSTM-NM model predictions as a function of the true wind speed label for (a) the\nvalidation set and (b) the test sets. A perfect model would carry a one-to-one ratio, indicated by\nthe \u2018Unity\u2019 line overlaid on the plot (dashed black line). Vertical error bars indicate one standard\ndeviation. Horizontal error bars indicate the range of wind speeds represented by a mark based on the\nbinning for the \ufb01eld datasets, and the measurement uncertainty from the anemometer used in the the\ntunnel test set. The wind speeds outside of the measurable range due to clip duration and frame rate\nare shown by the shaded red and yellow shaded regions respectively.\n\nHere we will look more speci\ufb01cally at the \ufb02apping \ufb02ag in the \ufb01eld for illustration. The pertinent\nphysics are captured by a frequency scale of order f where,\n\nf = U/L\n\n(5)\n\nis the frequency corresponding to a \ufb02uid element passing by the \ufb02ag, L is the length of the \ufb02ag, and\nU is the wind speed.\nAt high wind speeds the measurement capabilities are limited by the sampling rate, fs. The Nyquist\nfrequency, de\ufb01ned as fN yquist = 0.5fs, is the highest frequency that a signal can have and still be\nobserved without the effects of aliasing. Using the Nyquist frequency as an upper bound for the\ncharacteristic frequency, the corresponding critical velocity, Uc,high can be calculated as:\n\nUc,high = LfN yquist\n\n(6)\n\nIn this case, fs = 15 Hz, given by the frame rate, yields a Nyquist frequency of fN yquist = 7.5 Hz.\nThe length of the \ufb02ag is \ufb01xed at L = 1.5 m. Applying these values to Equation 6 gives Uc,high = 11\nm/s. For wind speeds exceeding this value, the characteristic frequency would not be measurable\nwithout aliasing. This appears to manifest as an under-prediction at high wind speed values in Figure\n3a.\nAt low wind speeds, the duration of clips, T , is the limiting factor. The lowest frequency that can be\nfully observed is f = 1/T . The critical velocity is then given by:\n\nUc,low = L/T\n\n(7)\n\nGiven a clip length of 2 seconds, Uc,low = 0.75 m/s. Wind speeds lower than that would have\nfundamental frequencies that are too low to fully observe. Because of these known limitations for\nmodel performance at speeds under 0.75 m/s and above 11 m/s, the RMSE for range 0.75-11 m/s\n(hereafter referred to as the measurable wind speed range) has been reported in addition to the overall\nRMSE. For the \ufb02ag validation set, the RMSE within this range was 1.37 m/s (results summarized\n\n7\n\n\fin Table 2). The red and yellow shaded regions in Figure 3 indicate wind speeds outside of this\nmeasurable range due to duration and sampling rate respectively.\n\n5.1.2 Comparison to Turbulent Fluctuations\n\nIn evaluating model performance, it is important to consider the natural variation in the wind speed\ndue to turbulence. Because of this variation, it is expected that the RMSE for the model predictions\nis at least as large as the standard deviation of turbulence \ufb02uctuations (denoted \u03c3u) calculated over\nthe 1-minute averaging time. The \ufb02uctuating velocity, u(cid:48), and \u03c3u are given in Equations 8 and 9\nrespectively:\n\nu(cid:48) = u(t) \u2212 U\n\n(cid:112)\n\n\u03c3u =\n\nu(cid:48)2\n\n(8)\n\n(9)\n\nwhere u(t) is the instantaneous velocity, and U is the time-averaged velocity. To calculate \u03c3u for\nour \ufb01eld site, 1-minute average wind speed measurements were used for the mean velocity, U, and\n2-second averages were used to represent the instantaneous velocity, u(t). The 2-second averaging\ntime for the instantaneous measurements was chosen in order to match the duration of the video\nclips used as model inputs. Thus, each prediction from the network can be seen as a comparable\ninstantaneous measurement, and in the ideal case, the standard deviation of the predicted values\nshould match the standard deviation of the instantaneous anemometer measurements at each wind\nspeed. To determine \u03c3u over the range of wind speeds, measurements of U were binned in 0.5 m/s\nincrements, and the mean natural variability due to turbulence (\u03c3u) was calculated for each bin,\nrepresented by the gray band in Figure 3a.\nThis analysis allows for two comparisons of predictions to anemometer data. The \ufb01rst comparison\nis to 1-minute average wind speeds that represent U at the site at a given time, which is shown by\nthe markers in Figure 3a compared to the dashed black line representing unity, which show good\nagreement. The second comparison is between the standard deviation of predictions to the \u03c3u. The\nsize of error bars representing the standard deviation in predictions is approaching the size of the gray\nband shown for \u03c3u. This result indicates that the model performance approaches the best possible\naccuracy given natural wind variability.\n\n5.2 LSTM-NM Test Results and Model Generalizability\n\nModel predictions for both the adjacent \ufb02ag test set and the tunnel test set are plotted against true\nlabels in Figure 3b. As discussed in Section 3, the adjacent \ufb02ag test set serves as a direct comparison\nto the validation set. Although the model still captures the increasing trend, the over-predictions at\nhigh wind speeds and under-predictions at low wind speeds are more pronounced than they were for\nthe validation set, visible in the \ufb02atter shape of the curve shown in Figure 3b.\nThe tunnel test set results are shown by the orange marks in Figure 3b. Similarly to the adjacent \ufb02ag\ntest set, the predictions for the tunnel test set capture the correct qualitative increasing trend, although\nthe highest wind speed case appears to be under-predicted.\nThere are plausible explanations for why the test set predictions lie in a narrower range than the\nvalidation set predictions. For the tunnel test set, the \ufb02ag length is shorter (0.37 m), which means\nthe model may be limited by physics at even lower speeds (Section 5.1.1). The narrower range of\npredictions for the corrected adjacent \ufb02ag test set suggests that the model may be partially over-\ufb01t\nto the speci\ufb01c \ufb02ag and tree it has been trained on (i.e. relying partly on speci\ufb01c features of those\nobjects). The effect of over-\ufb01tting may become less signi\ufb01cant if the training set were expanded to\ninclude a more diverse set of \ufb02ags and trees.\nTest set predictions are still monotonically increasing with increasing true wind speed, suggesting\nthat model capabilities are only partially limited by factors such as the frame rate or over-\ufb01tting. For\nboth test sets, the RMSE in the measurable wind speed range was close to the validation set (Table\n2). This indicates that the current model has potential to make predictions for \ufb02ags other than the\none it has been trained on, and \ufb02ags that exist in new locations. This suggests the possibility for\n\n8\n\n\fgeneralizability of this type of model in new settings, and its potential for broader use in mapping\nwind speeds.\n\n5.3 LSTM-raw Results and Effect of Mean Subtraction\n\nAs discussed in Section 4.2, the LSTM-raw experiment used the same model architecture and\nhyperparameters as the LSTM-NM experiment, but used raw inputs rather than inputs with temporal\nmean-subtracted inputs. The LSTM-raw model performed very similarly to the LSTM-NM model on\nthe validation sets (Table 2). However, it performed notably worse on both test sets. For the adjacent\n\ufb02ag test set, the LSTM-raw model gave a RMSE of 3.10 m/s in the measureable range compared to\n1.85 m/s for the LSTM-NM model. For the tunnel test set, the LSTM-raw model gave a RMSE of\n9.61 m/s, compared to 1.82 m/s for LSTM-NM. The decrease in performance on the test sets for the\nLSTM-raw model indicates that without a temporal mean subtraction, this model was unable to make\naccurate predictions for a \ufb02ag in a new location with a different background.\n\n6 Conclusion and Future Work\n\nHere, a coupled CNN and RNN using the ResNet-18 and LSTM architectures was trained to\nsuccessfully predict wind speeds within a range from 0.75-11 m/s using videos of \ufb02apping \ufb02ags and\nswaying trees, with prediction errors approaching the minimum expected error due to turbulence\n\ufb02uctuations on the validation set. Model performance on test sets consisting of new \ufb02ags in additional\nlocations suggests that such a model may generalize, and could therefore prove useful in measuring\nwind speeds in new environments. Using this data-driven approach to visual anemometry could offer\nsigni\ufb01cant bene\ufb01ts in applications such as mapping complex wind \ufb01elds in urban environments, as it\ncould cut down on the time and cost required to measure wind speeds at several locations, which is\ncurrently done by installing an instrument at each point of interest.\nAlthough this study focused speci\ufb01cally on video clips of checkered \ufb02ags and a magnolia tree, this\napproach to wind speed measurement can potentially generalize to other types of objects, including\nother types of \ufb02ags and natural vegetation that interact with the surrounding wind. In addition to\ntraining on a broader dataset including a variety of objects, con\ufb01dence in the potential for model\ngeneralization can be further improved given a deeper understanding of which relevant physics the\nmodel is using for prediction. Here we showed how the measurement capabilities of a model were\nlimited by the fundamental frequency of the \ufb02ag. Future work will focus on understanding which\nphysics of the \ufb02uid structure interactions are extracted by the model and are necessary for accurate\npredictions.\n\n7 Acknowledgements\n\nThe authors acknowledge Kelyn Wood, who assisted in setup for the wind tunnel test set. J.L.C. is\nfunded through the Brit and Alex d\u2019Arbeloff Stanford Graduate Fellowship, and M.F.H. is funded\nthrough a National Science Foundation Graduate Research Fellowship under Grant DGE-1656518\nand a Stanford Graduate Fellowship.\n\nReferences\n\n[1] Silas Alben and Michael J. Shelley. Flapping states of a \ufb02ag in an inviscid \ufb02uid: bistability and\n\nthe transition to chaos. Physical Review Letters, 100(7), 2008.\n\n[2] Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt.\nSequential Deep Learning for Human Action Recognition. In International Workshop on Human\nBehavior Understanding, Berlin, Heidelberg, 2011. Springer.\n\n[3] Thanasis G. Barbounis, John B. Theocharis, Minas C. Alexiadis, and Petros S. Dokopoulos.\nLong-Term Wind Speed and Power Forecasting Using Local Recurrent Neural Network Models.\nIEEE Transactions on Energy Conversion, 21(1):273\u2013284, 2006.\n\n[4] Kanna Bhaskar and S. N. Singh. AWNN-Assisted Wind Power Forecasting Using Feed-Forward\n\nNeural Network. IEEE Transactions on Sustainable Energy, 3(2):306\u2013315, 4 2012.\n\n9\n\n\f[5] Kiran S. Bhat, Christopher D. Twigg, Jessica K. Hodgins, Pradeep K. Khosla, Zoran Popovi\u00b4c,\nand Steven M. Seitz. Estimating cloth simulation parameters from video. Proceedings of the\n2003 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA 2003, pages\n37\u201351, 2003.\n\n[6] Katherine L. Bouman, Bei Xiao, Peter Battaglia, and William T. Freeman. Estimating the\nmaterial properties of fabric from video. In Proceedings of the IEEE International Conference\non Computer Vision, pages 1984\u20131991, 2013.\n\n[7] Erasmo Cadenas and Wilfrido Rivera. Wind speed forecasting in three different regions of\nMexico, using a hybrid ARIMA\u2013ANN model. Renewable Energy, 35(12):2732\u20132738, 2010.\n[8] Jo\u00e3o Carreira and Andrew Zisserman. Quo Vadis, Action Recognition? A New Model and\nthe Kinetics Dataset. In IEEE Conference on Computer Vision and Pattern Recognition, pages\n6299\u20136308, 2017.\n\n[9] Francesco Castellani, Massimiliano Burlando, Samad Taghizadeh, Davide Astol\ufb01, and\nEmanuele Piccioni. Wind energy forecast in complex sites with a hybrid neural network\nand CFD based method. Energy Procedia, 45:188\u2013197, 2014.\n\n[10] Aiguo Dai and Clara Deser. Diurnal and semidiurnal variations in global surface wind and\n\ndivergence \ufb01elds. Journal Of Geophysical Research, 104(D24):31109\u201331125, 1999.\n\n[11] Abe Davis, Katherine L. Bouman, Justin G. Chen, Michael Rubinstein, Oral B\u00fcy\u00fck\u00f6zt\u00fcrk,\nFr\u00e9do Durand, and William T. Freeman. Visual Vibrometry: Estimating Material Properties\nfrom Small Motions in Video. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n39(4):5335\u20135343, 2015.\n\n[12] Jeff Donahue, Lisa A. Hendricks, Sergio Guadarrama, Marcus Rohrbach, Venugopalan Sub-\nhashini, Kate Saenko, and Trevor Darrell. Long-term Recurrent Convolutional Networks for\nVisual Recognition and Description. In IEEE Conference on Computer Vision and Pattern\nRecognition, pages 2625\u20132634, 2015.\n\n[13] Zhenhai Guo, Weigang Zhao, Haiyan Lu, and Jianzhou Wang. Multi-step forecasting for\nwind speed using a modi\ufb01ed EMD-based arti\ufb01cial neural network model. Renewable Energy,\n37(1):241\u2013249, 2012.\n\n[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image\nRecognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778,\n2016.\n\n[15] Sepp Hochreiter. The Vanishing Gradient Problem During Learning Recurrent Neural Nets\nand Problem Solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based\nSystems, 6(2):107\u2013116, 1998.\n\n[16] Sepp Hochreiter and Jj Urgen Schmidhuber. Long Short-Term Memory. Neural Computation,\n\n9(8):1735\u20131780, 1997.\n\n[17] The MathWorks Inc. Deep Learning Toolbox (2019). https://www.mathworks.com/\n\nproducts/deep-learning.html.\n\n[18] The MathWorks\n\nInc.\n\ndeeplearning/ref/resnet18.html.\n\nresnet18 (2019).\n\nhttps://www.mathworks.com/help/\n\n[19] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3D Convolutional Neural Networks for\nHuman Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n35(1):221\u2013231, 2012.\n\n[20] Daniel M. Kammen and Deborah A. Sunter. City-integrated renewable energy for urban\n\nsustainability. Science, 352(6288):922\u2013928, 2016.\n\n[21] Andrej Karpathy, Justin Johnson, and Li Fei-fei. Visualizing and understanding recurrent\n\nnetworks. arXiv preprint arXiv:1506.02078, 2015.\n\n[22] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and\nLi Fei-Fei. Large-scale Video Classi\ufb01cation with Convolutional Neural Networks. In Proceed-\nings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725\u20131732,\n2014.\n\n[23] Gong Li and Jing Shi. On comparing three arti\ufb01cial neural networks for wind speed forecasting.\n\nApplied Energy, 87(7):2313\u20132320, 2010.\n\n10\n\n\f[24] Hui Liu, Xiwei Mi, and Yanfei Li. Smart multi-step deep learning model for wind speed\nforecasting based on variational mode decomposition, singular spectrum analysis, LSTM\nnetwork and ELM. Energy Conversion and Management, 159:54\u201364, 2018.\n\n[25] Abhimitra Meka, Maxim Maximov, Michael Zollhofer, Avishek Chatterjee, Hans P. Seidel,\nChristian Richardt, and Christian Theobalt. LIME: Live Intrinsic Material Estimation. Proceed-\nings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,\npages 6315\u20136324, 2018.\n\n[26] Mohamed A. Mohandes, Sha\ufb01qur Rehman, and Talal O. Halawan1. A Neural Networks\n\nApproach For Wind Speed Prediction. Renewable Energy, 13(3):345\u2013354, 1998.\n\n[27] Roozbeh Mottaghi, Hessam Bagherinezhad, Mohammad Rastegari, and Ali Farhadi. Newtonian\nImage Understanding: Unfolding the Dynamics of Objects in Static Images. In Proceedings of\nthe IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages\n3521\u20133529, 2016.\n\n[28] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning Internal Repre-\nsentations By Error Propagation. No. ICS-8506. California Univ San Diego La Jolla Inst for\nCognitive Science, 1985.\n\n[29] Tom F. H. Runia, Kirill Gavrilyuk, Cees G. M. Snoek, and Arnold W. M. Smeulders. Go with\nthe Flow: Perception-re\ufb01ned Physics Simulation. arXiv preprint arXiv:1910.07861v1, 2019.\n[30] Hidetomo Sakaino. Fluid motion estimation method based on physical properties of waves. In\n\nIEEE Conference on Computer Vision and Pattern Recognition, pages 1\u20138, 2008.\n\n[31] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo.\nConvolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting.\nIn Advances in neural information processing systems, pages 802\u2013810, 2015.\n\n[32] Karen Simonyan and Andrew Zisserman. Two-Stream Convolutional Networks for Action\nRecognition in Videos. In Advances in neural information processing systems, pages 568\u2013576,\n2014.\n\n[33] Lisa Spencer and Mubarak Shah. Water video analysis. In 2004 International Conference on\n\nImage Processing, 2004. ICIP\u201904, volume 4, pages 2705\u20132708. IEEE, 2004.\n\n[34] Joshuah K. Stolaroff, Constantine Samaras, Emma R. O\u2019Neill, Alia Lubers, Alexandra S.\nMitchell, and Daniel Ceperley. Energy use and life cycle greenhouse gas emissions of drones\nfor commercial package delivery. Nature Communications, 9(1):409, 2018.\n\n[35] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learn-\ning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the IEEE\ninternational conference on computer vision, pages 4489\u20134497, 2015.\n\n[36] John E. Wade and E. Wendell Hewson. Trees as a Local Climatic Wind Indicator. Journal of\n\nApplied Meteorology, 18(9):1182\u20131187, 1979.\n\n[37] Jiajun Wu, Joseph J. Lim, Hongyi Zhang, Joshua B. Tenenbaum, and William T. Freeman.\nPhysics 101: Learning physical object properties from unlabeled videos. In BMVC, volume 2,\n2016.\n\n[38] Jiajun Wu, Ilker Yildirim, Joseph J. Lim, William T. Freeman, and Joshua B. Tenenbaum.\nGalileo: Perceiving physical object properties by integrating a physics engine with deep learning.\nIn Advances in neural information processing systems, pages 127\u2013135, 2015.\n\n[39] Shan Yang, Junbang Liang, and Ming C. Lin. Learning-Based Cloth Material Recovery from\nIn Proceedings of the IEEE International Conference on Computer Vision, pages\n\nVideo.\n4383\u20134393, 2017.\n\n[40] Haiyang Yu, Zhihai Wu, Shuqin Wang, Yunpeng Wang, and Xiaolei Ma. Spatiotemporal\nRecurrent Convolutional Networks for Traf\ufb01c Prediction in Transportation Networks. Sensors,\n17(7):1501, 2017.\n\n[41] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat\nMonga, and George Toderici. Beyond Short Snippets: Deep Networks for Video Classi\ufb01cation.\nIn IEEE Conference on Computer Vision and Pattern Recognition, pages 4694\u20134702, 2015.\n\n11\n\n\f", "award": [], "sourceid": 4701, "authors": [{"given_name": "Jennifer", "family_name": "Cardona", "institution": "Stanford University"}, {"given_name": "Michael", "family_name": "Howland", "institution": "Stanford University"}, {"given_name": "John", "family_name": "Dabiri", "institution": "Stanford University"}]}