{"title": "Learning a Warping Distance from Unlabeled Time Series Using Sequence Autoencoders", "book": "Advances in Neural Information Processing Systems", "page_first": 10547, "page_last": 10555, "abstract": "Measuring similarities between unlabeled time series trajectories is an important problem in many domains such as medicine, economics, and vision. It is often unclear what is the appropriate metric to use because of the complex nature of noise in the trajectories (e.g. different sampling rates or outliers). Experts typically hand-craft or manually select a specific metric, such as Dynamic Time Warping (DTW), to apply on their data. In this paper, we propose an end-to-end framework, autowarp, that optimizes and learns a good metric given unlabeled trajectories. We define a flexible and differentiable family of warping metrics, which encompasses common metrics such as DTW, Edit Distance, Euclidean, etc. Autowarp then leverages the representation power of sequence autoencoders to optimize for a member of this warping family. The output is an metric which is easy to interpret and can be robustly learned from relatively few trajectories. In systematic experiments across different domains, we show that autowarp often outperforms hand-crafted trajectory similarity metrics.", "full_text": "Autowarp: Learning a Warping Distance from\n\nUnlabeled Time Series Using Sequence Autoencoders\n\nAbubakar Abid\nStanford University\na12d@stanford.edu\n\nAbstract\n\nJames Zou\n\nStanford University\n\njamesz@stanford.edu\n\nMeasuring similarities between unlabeled time series trajectories is an important\nproblem in domains as diverse as medicine, astronomy, \ufb01nance, and computer\nvision. It is often unclear what is the appropriate metric to use because of the\ncomplex nature of noise in the trajectories (e.g. different sampling rates or outliers).\nDomain experts typically hand-craft or manually select a speci\ufb01c metric, such as\ndynamic time warping (DTW), to apply on their data. In this paper, we propose\nAutowarp, an end-to-end algorithm that optimizes and learns a good metric given\nunlabeled trajectories. We de\ufb01ne a \ufb02exible and differentiable family of warping\nmetrics, which encompasses common metrics such as DTW, Euclidean, and edit\ndistance. Autowarp then leverages the representation power of sequence autoen-\ncoders to optimize for a member of this warping distance family. The output is a\nmetric which is easy to interpret and can be robustly learned from relatively few\ntrajectories. In systematic experiments across different domains, we show that\nAutowarp often outperforms hand-crafted trajectory similarity metrics.\n\n1\n\nIntroduction\n\nA time series, also known as a trajectory, is a sequence of observed data t = (t1, t2, ...tn) measured\nover time. A large number of real world data in medicine [1], \ufb01nance [2], astronomy [3] and computer\nvision [4] are time series. A key question that is often asked about time series data is: \"How similar\nare two given trajectories?\" A notion of trajectory similarity allows one to do unsupervised learning,\nsuch as clustering and visualization, of time series data, as well as supervised learning, such as\nclassi\ufb01cation [5]. However, measuring the distance between trajectories is complex, because of the\ntemporal correlation between data in a time series and the complex nature of the noise that may be\npresent in the data (e.g. different sampling rates) [6].\nIn the literature, many methods have been proposed to measure the similarity between trajectories. In\nthe simplest case, when trajectories are all sampled at the same frequency and are of equal length,\nEuclidean distance can be used [7]. When comparing trajectories with different sampling rates,\ndynamic time warping (DTW) is a popular choice [7]. Because the choice of distance metric can\nhave a signi\ufb01cant effect on downstream analysis [5, 6, 8], a plethora of other distances have been\nhand-crafted based on the speci\ufb01c characteristics of the data and noise present in the time series.\nHowever, a review of \ufb01ve of the most popular trajectory distances found that no one trajectory\ndistance is more robust than the others to all of the different kinds of noise that are commonly present\nin time series data [8]. As a result, it is perhaps not surprising that many distances have been manually\ndesigned for different time series domains and datasets. In this work, we propose an alternative to\nhand-crafting a distance: we develop an end-to-end framework to learn a good similarity metric\ndirectly from unlabeled time series data.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Learning a Distance with Autowarp. Here we visualize the stages of Autowarp by using\nmulti-dimensional scaling (MDS) to embed a set of 50 trajectories into two dimensions at each step\nof the algorithm. Each dot represents one observed trajectory that is generated by adding Gaussian\nnoise and outliers to 10 copies of 5 seed trajectories (each color represents a seed). (left) First, we\nrun MDS on the original trajectories with Euclidean distance. (center) Next, we run MDS on the\nlatent representations learned with a sequence-to-sequence autoencoder, which partially resolves the\noriginal clusters. (right) Finally, we run MDS on the original trajectories using the learned warping\ndistance, which completely resolves the original clusters.\n\nWhile data-dependent analysis of time-series is commonly performed in the context of supervised\nlearning (e.g. using RNNs or convolutional networks to classify trajectories [9]), this is not often\nperformed in the case when the time series are unlabeled, as it is more challenging to determine\nnotions of similarity in the absence of labels. Yet the unsupervised regime is critical, because in many\ntime series datasets, ground-truth labels are dif\ufb01cult to determine, and yet the notion of similarity\nplays a key role. For example, consider a set of disease trajectories recorded in a large electronic\nhealth records database: we have the time series information of the diseases contracted by a patient,\nand it may be important to determine which patient in our dataset is most similar to another patient\nbased on his or her disease trajectory. Yet, the choice of ground-truth labels is ambiguous in this case.\nIn this work, we develop an easy-to-use method to determine a distance that is appropriate for a given\nset of unlabeled trajectories.\nIn this paper, we restrict ourselves to the family of trajectory distances known as warping distances\n(formally de\ufb01ned in Section 2). This is for several reasons: warping distances have been widely\nstudied, and are intuitive and interpretable [7]; they are also ef\ufb01cient to compute, and numerous\nheuristics have been developed to allow nearest-neighbor queries on datasets with as many as trillions\nof trajectories [10]. Thirdly, although they are a \ufb02exible and general class, warping distances are\nparticularly well-suited to trajectories, and serve as a means of regularizing the unsupervised learning\nof similarity metrics directly from trajectory data. We show through systematic experiments that\nlearning an appropriate warping distance can provide insight into the nature of the time series data,\nand can be used to cluster, query, or visualize the data effectively.\n\nRelated Work The development of distance metrics for time series stretches at least as far back\nas the introduction of dynamic time warping (DTW) for speech recognition [11]. Limitations of\nDTW led to the development and adoption of the Edit Distance on Real Sequence (EDR) [12], the\nEdit Distance with Real Penalty (ERP) [13], and the Longest Common Subsequence (LCSS) [14] as\nalternative distances. Many variants of these distances have been proposed, based on characteristics\nspeci\ufb01c to certain domains and datasets, such as the Symmetric Segment-Path Distance (SSPD) [7]\nfor GPS trajectories, Subsequence Matching [15] for medical time series data, among others [16].\nPrior work in metric learning from trajectories is generally limited to the supervised regime. For\nexample, in recent years, convolutional neural networks [9], recurrent neural networks (RNNs) [17],\nand siamese recurrent neural networks [18] have been proposed to classify neural networks based on\nlabeled training sets. There has also been some work in applying unsupervised deep learning learning\nto time series [19]. For example, the authors of [20] use a pre-trained RNN to extract features from\ntime-series that are useful for downstream classi\ufb01cation. Unsupervised RNNs have also found use in\nanomaly detection [21] and forecasting [22] of time series.\n\n2\n\n\f2 The Autowarp Approach\n\nOur approach, which we call Autowarp, consists of two steps. First, we learn a latent representation\nfor each trajectory using a sequence-to-sequence autoencoder. This representation takes advantage\nof the temporal correlations present in time series data to learn a low-dimensional representation of\neach trajectory. In the second stage, we search a family of warping distances to identify the warping\ndistance that, when applied to the original trajectories, is most similar to the Euclidean distances\nbetween the latent representations. Fig. 1 shows the application of Autowarp to synthetic data.\n\nLearning a latent representation Autowarp \ufb01rst learns a latent representation that captures the\nsigni\ufb01cant properties of trajectories in an unsupervised manner. In many domains, an effective\nlatent representation can be learned by using autoencoders that reconstruct the input data from a\nlow-dimensional representation. We use the same approach using sequence-to-sequence autoencoders.\nThis approach is inspired by similar sequence-to-sequence autoencoders, which have been success-\nfully applied to sentiment classi\ufb01cation [23], machine translation [24], and learning representations\nof videos [25]. In the architecture that we use (illustrated in Fig. 2), we feed each step in the\ntrajectory sequentially into an encoding LSTM layer. The hidden state of the \ufb01nal LSTM cell is\nthen fed identically into a decoding LSTM layer, which contains as many cells as the length of the\noriginal trajectory. This layer attempts to reconstruct each trajectory based solely on the learned\nlatent representation for that trajectory.\nWhat kind of features are learned in the latent representation? Generally, the hidden representation\ncaptures overarching features of the trajectory, while learning to ignore outliers and sampling rate.\nWe illustrate this in Fig. S1 in Appendix A: the LSTM autoencoders learn to denoise representations\nof trajectories that have been sampled at different rates, or in which outliers have been introduced.\nFigure 2: Schematic for LSTM\nSequence-Sequence Autoencoder.\nWe learn a latent representation for\neach trajectory by passing it through\na sequence-to-sequence autoencoder\nthat is trained to minimize the recon-\n\nstruction loss(cid:13)(cid:13)t \u2212 \u02dct(cid:13)(cid:13)2 between the\n\noriginal trajectory t and decoded tra-\njectory \u02dct. In the decoding stage, the\nlatent representation h is passed as\ninput into each LSTM cell.\n\nWarping distances Once a latent representation is learned, we search from a family of warping\ndistances to \ufb01nd the warping distance across the original trajectories that mimics the distances\nbetween each trajectory\u2019s latent representations. This can be seen as \u201cdistilling\" the representation\nlearned by the neural network into a warping distance (e.g. see [26]). In addition, as warping\ndistances are generally well-suited to trajectories, this serves to regularize the process of distance\nmetric learning, and generally produces better distances than using the latent representations directly\n(as illustrated in Fig. 1).\nWe proceed to formally de\ufb01ne a warping distance, as well as the family of warping distances that we\nwork with for the rest of the paper. First, we de\ufb01ne a warping path between two trajectories.\nDe\ufb01nition 1. A warping path p = (p0, . . . pL) between two trajectories tA = (tA\ntB = (tB\ntrajectory tA or is null (which we will denote as tA\nor is null (which we will denote as tB\n\nn ) and\nm) is a sequence of pairs of trajectory states, where the \ufb01rst state comes from\n0 ), and the second state comes from trajectory tB\n\n0 ). Furthermore, p must satisfy two properties:\n\n1 , . . . tB\n\n1 , . . . tA\n\n\u2022 boundary conditions: p0 = (tA\n\u2022 valid steps: pk = (tA\n\ni , tB\n\nj ) =\u21d2 pk+1 \u2208 {(tA\n\n0 , tB\n\n0 ) and pL = (tA\n\nn , tB\nm)\n\ni+1, tB\n\nj ), (tA\n\ni+1, tB\n\nj+1), (tA\n\ni , tB\n\nj+1)}.\n\n3\n\n= Encoder LSTM\ud835\udc611\ud835\udc612\ud835\udc61\ud835\udc41\u210e\u210e\u210e\u01c1\ud835\udc611\u01c1\ud835\udc612\u01c1\ud835\udc61\ud835\udc41= Decoder LSTMLearnedRepresentation\fWarping paths can be seen as traversals on a (n + 1)-by-(m + 1) grid from the bottom left to the top\nright, where one is allowed to go up one step, right one step, or one step up and right, as shown in Fig.\nS2 in Appendix A. We shall refer to these as vertical, horizontal, and diagonal steps respectively.\nDe\ufb01nition 2. Given a set of trajectories T , a warping distance d is a function that maps each pair\nof trajectories in T to a real number \u2208 [0,\u221e). A warping distance is completely speci\ufb01ed in terms of\na cost function c(\u00b7,\u00b7) on two pairs of trajectory states:\nLet tA, tB \u2208 T . Then d(tA, tB) is de\ufb01ned1 as d(tA, tB) = minp \u2208P\nThe function c(pi\u22121, pi) represents the cost of taking the step from pi\u22121 to pi, and, in general, differs\nfor horizontal, vertical, and diagonal steps. P is the set of all warping paths between tA and tB.\n\ni=1 c(pi\u22121, pi)\n\n(cid:80)L\n\nThus, a warping distance represents a particular optimization carried over all valid warping paths\nbetween two trajectories. In this paper, we de\ufb01ne a family of warping distances D, with the following\nparametrization of c(\u00b7,\u00b7):\n\nc((tA\n\ni , tB\n\nj ), (tA\n\ni(cid:48) , tB\n\nj(cid:48))) =\n\n(cid:40)\n\u03c3((cid:13)(cid:13)tA\nj(cid:48)(cid:13)(cid:13) ,\n1\u2212\u03b1 \u00b7 \u03c3((cid:13)(cid:13)tA\n\ni(cid:48) \u2212 tB\n\n\u03b1\n\nj(cid:48)(cid:13)(cid:13) ,\n\n\u0001\n1\u2212\u0001 )\ni(cid:48) \u2212 tB\n\ni(cid:48) > i, j(cid:48) > j\ni(cid:48) = i or j(cid:48) = j\n\n\u0001\n1\u2212\u0001 ) + \u03b3\n\n(1)\n\nHere, we de\ufb01ne \u03c3(x, y) def= y \u00b7 tanh(x/y) to be a soft thresholding function, such that \u03c3(x, y) \u2248 x if\n0 \u2264 x \u2264 y and \u03c3(x, y) \u2248 y if x > y. And, \u03c3(x,\u221e) def= x. The family of distances D is parametrized\nby three parameters \u03b1, \u03b3, \u0001. With this parametrization, D includes several commonly used warping\ndistances for trajectories, as shown in Table 1, as well as many other warping distances.\n\nTrajectory Distance\nEuclideana\nDynamic Time Warping (DTW) [11]\nEdit Distance (\u03b30) [13]\nEdit Distance on Real Sequences (\u03b30, \u00010) [12] b\n\n\u03b1\n1\n0.5\n0\n0\n\n\u03b3\n0\n0\n\n0 < \u03b30\n0 < \u03b30\n\n\u0001\n\n1\n1\n1\n\n0 < \u00010 < 1\n\nTable 1: Parametrization of common trajectory dissimilarities\n\naThe Euclidean distance between two trajectories is in\ufb01nite if they are of different lengths\nbThis is actually a smooth, differentiable approximation to EDR\n\nOptimizing warping distance using betaCV Within our family of warping distances, how do we\nchoose the one that aligns most closely with the learned latent representation? To allow a comparison\nbetween latent representations and trajectory distances, we use the concept of betaCV:\nDe\ufb01nition 3. Given a set of trajectories T = {t1, t2, . . . tT}, a trajectory metric d and an assignment\nto clusters C(i) for each trajectory ti, the betaCV, denoted as \u03b2, is de\ufb01ned as:\n\n1\nZ\n\n\u03b2(d) =\n\ni=1\n\nj=1 d(ti, tj) 1 [C(i) = C(j)]\n\n,\n\n(2)\n\n(cid:80)T\n\n(cid:80)T\n(cid:80)T\n\n1\nT 2\n\n(cid:80)T\n\ni=1\n\nj=1 d(ti, tj)\n\ni=1\n\nj=1 1 [C(i) = C(j)] is the normalization constant needed to transform the\n\nwhere Z = (cid:80)T\n\n(cid:80)T\n\nnumerator into an average of distances.\n\nIn the literature, the betaCV is used to evaluate different clustering assignments C for a \ufb01xed distance\n[27]. In our work, it is the distance d that is not known; were true cluster assignments known, the\nbetaCV would be a natural quantity to minimize over the distances in D, as it would give us a distance\nmetric that minimizes the average distance of trajectories to other trajectories within the same cluster\n(normalized by the average distances across all pairs of trajectories).\nHowever, as the clustering assignments are not known, we instead use the Euclidean distances\nbetween to the latent representations of two trajectories to determine whether they belong to the same\n\u201ccluster.\" In particular, we designate two trajectories as belonging to the same cluster if the distance\nbetween their latent representations is less than a threshold \u03b4, which is chosen as a percentile \u00afp of the\n1A more general de\ufb01nition of warping distance replaces the summation over c(pi\u22121, p) with a general class\nof statistics, that may include max and min for example. For simplicity, we present the narrower de\ufb01nition here.\n\n4\n\n\f(cid:80)T\n\ni=1\n\n1\nZ\n\n\u02c6\u03b2 =\n\nj=1 d(ti, tj) 1(cid:2)(cid:107)hi \u2212 hj(cid:107)2 < \u03b4(cid:3)\n(cid:80)T\n(cid:80)T\n\n(cid:80)T\n\ni=1\n\nj=1 d(ti, tj)\n\n1\nT 2\n\ndistribution of distances between all pairs of latent representations. We will denote this version of the\nbetaCV, calculated based on the latent representations learned by an autoencoder, as \u02c6\u03b2h(d):\nDe\ufb01nition 4. Given a set of trajectories T = {t1, t2, . . . tT}, a metric d and a latent representation\nfor each trajectory hi, the latent betaCV, denoted as \u02c6\u03b2, is de\ufb01ned as:\n\n,\n\n(3)\n\nwhere Z is a normalization constant de\ufb01ned analogously as in (2). The threshold distance \u03b4 is a\nhyperparameter for the algorithm, generally set to be a certain threshold percentile (\u00afp) of all pairwise\ndistances between latent representations.\n\nWith this de\ufb01nition in hand, we are ready to specify how we choose a warping distance based on the\nlatent representations. We choose the warping distance that gives us the lowest latent betaCV:\n\n\u02c6d = arg min\n\nd\u2208D\n\n\u02c6\u03b2(d).\n\nWe have seen that the learned representations hi are not always able to remove the noise present in\nthe observed trajectories. It is natural to ask, then, whether it is a good idea to calculate the betaCV\nusing the noisy latent representations, in place of true clustering assignments. In other word, suppose\nwe computed \u03b2 based on known clusters assignments in a trajectory dataset. If we then computed \u02c6\u03b2\nbased on somewhat noisy learned latent representations, could it be that \u03b2 and \u02c6\u03b2 differ markedly? In\nAppendix C, we carry out a theoretical analysis, assuming that the computation of \u02c6\u03b2 is based on a\nnoisy clustering \u02dcC. We present the conclusion of that analysis here:\nProposition 1 (Robustness of Latent BetaCV). Let d be a trajectory distance de\ufb01ned over a set of\ntrajectories T of cardinality T . Let \u03b2(d) be the betaCV computed on the set of trajectories using the\ntrue cluster labels {C(i)}. Let \u02c6\u03b2(d) be the betaCV computed on the set of trajectories using noisy\ncluster labels { \u02dcC(i)}, which are generated by independently randomly reassigning each C(i) with\nprobability p. For a constant K that depends on the distribution of the trajectories, the probability\nthat the latent betaCV changes by more than x beyond the expected Kp is bounded by:\n\nPr(|\u03b2 \u2212 \u02c6\u03b2| > Kp + x) \u2264 e\u22122T x2/K2\n\n(4)\n\nThis result suggests that a latent betaCV computed based on latent representations may still be a\nreliable metric even when the latent representations are somewhat noisy. In practice, we \ufb01nd that the\nquality of the autoencoder does have an effect on the quality of the learned warping distance, up to a\ncertain extent. We quantify this behavior using an experiment showin in Fig. S4 in Appendix A.\n\n3 Ef\ufb01ciently Implementing Autowarp\n\nThere are two computational challenges to \ufb01nding an appropriate warping distance. One is ef\ufb01ciently\nsearching through the continuous space of warping distances. In this section, we show that the\ncomputation of the BetaCV over the family of warping distances de\ufb01ned above is differentiable with\nrespect to quantities \u03b1, \u03b3, \u0001 that parametrize the family of warping distances. Computing gradients\nover the whole set of trajectories is still computationally expensive for many real-world datasets, so\nwe introduce a method of sampling trajectories that provides signi\ufb01cant speed gains. The formal\noutline of Autowarp is in Appendix B.\n\nDifferentiability of betaCV.\nIn Section 2, we proposed that a warping distance can be identi\ufb01ed\nby the distance d \u2208 D that minimizes the BetaCV computed from the latent representations. Since\nD contains in\ufb01nitely many distances, we cannot evaluate the BetaCV for each distance, one by one.\nRather, we solve this optimization problem using gradient descent. In Appendix C, we prove the that\nBetaCV is differentiable with respect to the parameters \u03b1, \u03b3, \u0001 and the gradient can be computed in\nO(T 2N 2) time, where T is the number of trajectories in our dataset and N is the number of elements\nin each trajectory (see Proposition 2).\n\n5\n\n\fFigure 3: Validating Latent BetaCV. We construct a synthetic time series dataset with Gaussian\nnoise and outliers added to the trajectories. We compute the latent betaCV for various distances (left),\nwhich closely matches the plot of the true betaCV (middle) computed based on knowledge of the\nseed clusters. As a control, we plot the betaCV computed based on the original trajectories (right).\nBlack dots represent the optimal value of \u03b1 and \u03b3 in each plot. Lower betaCV is better.\n\nBatched gradient descent. When the size of the dataset becomes modestly large, it is no longer\nfeasible to re-compute the exact analytical gradient at each step of gradient descent. Instead, we take\ninspiration from negative sampling in word embeddings [28], and only sample a \ufb01xed number, S, of\npairs of trajectories at each step of gradient descent. This reduces the runtime of each step of gradient\ndescent to O(SN 2), where S \u2248 32 \u2212 128 in our experiments. Instead of the full gradient descent,\nthis effectively becomes batched gradient descent. The complete algorithm for batched Autowarp is\nshown in Algorithm 1 in Appendix B. Because the betaCV is not convex in terms of the parameters\n\u03b1, \u03b3, \u0001, we usually repeat Algorithm 1 with multiple initializations and choose the parameters that\nproduce the lowest betaCV.\n\n4 Validation\n\nRecall that Autowarp learns a distance from unlabeled trajectory data in two steps: \ufb01rst, a latent\nrepresentation is learned for each trajectory; secondly, a warping distance is identi\ufb01ed that is most\nsimilar to the learned latent representations. In this section, we empirically validate this methodology.\n\nValidating latent betaCV. We generate synthetic trajectories that are copies of a seed trajectory\nwith different kinds of noise added to each trajectory. We then measure the \u02c6\u03b2h for a large number of\ndistances in D. Because these are synthetic trajectories, we compare this to the true \u03b2 measured using\nthe known cluster labels (each seed generates one cluster). As a control, we also consider computing\nthe betaCV based on the Euclidean distance of the original trajectories, rather than the Euclidean\ndistance between the latent representations. We denote this quantity as \u02c6\u03b2t and treat it as a control.\nFig. 3 shows the plot when the noise takes the form of adding outliers and Gaussian noise to the data.\nThe betaCVs are plotted for distances d for different values of \u03b1 and \u03b3 with \u0001 = 1. Plots for other\nkinds of noise are included in Appendix A (see Fig. S7). These plots suggest that \u02c6\u03b2h assigns each\ndistance a betaCV that is representative of the true clustering labels. Furthermore, we \ufb01nd that the\ndistances that have the lowest betaCV in each case concur with previous studies that have studied the\nrobustness of different trajectory distances. For example, we \ufb01nd that DTW (\u03b1 = 0.5, \u03b3 = 0) is the\nappropriate distance metric for resampled trajectories, Euclidean (\u03b1 = 1, \u03b3 = 0) for Gaussian noise,\nand edit distance (\u03b1 = 0, \u03b3 \u2248 0.4) for trajectories with outliers.\n\nAblation and sensitivity analysis. Next, we investigate the sensitivity of the latent betaCV calcu-\nlation to the hyperparameters of the algorithm. We \ufb01nd that although the betaCV changes as the\nthreshold changes, the relative ordering of different warping distances mostly remains the same.\nSimilarly, we \ufb01nd that the dimension of the hidden layer in the autoencoder can vary signi\ufb01cantly\nwithout signi\ufb01cantly affecting qualitative results (see Fig. 4). For a variety of experiments, we \ufb01nd\nthat a reasonable number of latent dimensions is \u2248 L \u00b7 D, where L is the average trajectory length\nand D the dimensionality. We also investigate whether both the autoencoder and the search through\nwarping distances are necessary for effective metric learning. Our results indicate that both are\n\n6\n\n\u000b\r\r\u000b\u000f\r\u000b\u0011\r\u000b\u0013\r\u000b\u0001\u000e\u000b\r\r\u000b\r\r\u000b\u000f\r\u000b\u0011\r\u000b\u0013\r\u000b\u0001\u000e\u000b\rh\u0003\u0002,9039\u0003\u001909,\u001a'\r\u000b\u0010\r\r\u000b\u0010\u0012\r\u000b\u0011\r\r\u000b\u0011\u0012\r\u000b\u0012\r\r\u000b\u0012\u0012\r\u000b\u0013\r\r\u000b\u0013\u0012\r\u000b\u0001\r\r\u000b\r\r\u000b\u000f\r\u000b\u0011\r\u000b\u0013\r\u000b\u0001\u000e\u000b\r\r\u000b\r\r\u000b\u000f\r\u000b\u0011\r\u000b\u0013\r\u000b\u0001\u000e\u000b\rt\u0003%7,\u00040.947\u000408\u0003\u001909,\u001a'\r\u000b\u0011\u0001\r\u000b\u0012\u000e\r\u000b\u0012\u0011\r\u000b\u0012\u0001\r\u000b\u0013\r\r\u000b\u0013\u0010\r\u000b\u0013\u0013\r\u000b\u0013\u0001\r\u000b\u0001\u000f\r\u000b\r\r\u000b\u000f\r\u000b\u0011\r\u000b\u0013\r\u000b\u0001\u000e\u000b\r\r\u000b\r\r\u000b\u000f\r\u000b\u0011\r\u000b\u0013\r\u000b\u0001\u000e\u000b\r\u0003%7:0\u0003\u001909,\u001a'\r\u000b\u0010\r\r\u000b\u0010\u0012\r\u000b\u0011\r\r\u000b\u0011\u0012\r\u000b\u0012\r\r\u000b\u0012\u0012\r\u000b\u0013\r\r\u000b\u0013\u0012\r\u000b\u0001\r\r\u000b\u0001\u0012\f(a)\n\n(b)\n\nFigure 4: Sensitivity Analysis on Trajectories with Outliers. (a) We investigate how the percentile\nthreshold parameter affects latent betaCV. (b) We also investigate the effect of changing the latent\ndimensionality on the relative ordering of the distances. We \ufb01nd that the qualitative ranking of\ndifferent distances is generally robust to the choice of these hyperparameters.\n\nneeded: using the latent representations alone results in noisy clustering, while the warping distance\nsearch cannot be applied in the original trajectory space to get meaningful results (Fig. 1).\n\nDownstream classi\ufb01cation. A key motivation of distance metric learning is the ability to perform\ndownstream classi\ufb01cation and clustering tasks more effectively. We validated this on a real dataset:\nthe Libras dataset, which consists of coordinates of users performing Brazilian sign language. The x-\nand y-coordinates of the positions of the subjects\u2019 hands are recorded, as well as the symbol that the\nusers are communicating, providing us labels to evaluate our distance metrics.\nFor this experiment, we chose a subset of 40 trajectories from 5 different categories. For a given\ndistance function d, we iterated over every trajectory and computed the 7 closest trajectories to it (as\nthere are a total of 8 trajectories from each category). We computed which fraction of the 7 shared\ntheir label with the original trajectory. A good distance should provide us with a higher fraction. We\nevaluated 50 distances: 42 of them were chosen randomly, 4 were well-known warping distances,\nand 4 were the result of performing Algorithm 1 from different initializations. We measured both the\nbetaCV of each distance, as well as the accuracy. The results are shown in Fig. 5, which shows a\nclear negative correlation (rank correlation is = 0.85) between betaCV and label accuracy.\n\nFigure 5: Latent BetaCV and Downstream Clas-\nsi\ufb01cation. Here, we choose 50 warping distances\nand plot the latent betaCV of each one on the Li-\nbras dataset, along with the average classi\ufb01cation\nwhen each trajectory is used to classify its near-\nest neighbors. Results suggest that minimizing la-\ntent betaCV provides a suitable distance for down-\nstream classi\ufb01cation.\n\n5 Autowarp Applied to Real Datasets\nMany of the hand-crafted distances mentioned earlier in the manuscript were developed for and tested\non particular time series datasets. We now turn to two such public datasets, and demonstrate how\nAutowarp can be used to learn an appropriate warping distance from the data. We show that the\nwarping distance that we learn is competitive with the original hand-crafted distances.\n\nTaxicab Mobility Dataset. We \ufb01rst turn to a dataset that consists of GPS measurements from 536\nSan-Francisco taxis over a 24-day period2. This dataset was used to test the SSPD distance metric for\n\n2Data can be downloaded from https://crawdad.org/epfl/mobility/20090224/cab/.\n\n7\n\n\u000b\u000f\r\u000b\u0011\r\u000b\u0013\r\u000b\u0001p\u0003507.039\u0004\u00040\u00039\u0004708\u00044\u0004/\r\u000b\u000f\r\u000b\u0011\r\u000b\u0013\r\u000b\u0001\u000e\u000b\rh\u0003\u001909,\u001a'\u001b%\u0003\u001c:.\u0004\u0004/0,3\u001c/\u00049\u001c\u001b#\u000e\r\u000e\u000e\r\u000fdh\u0003\u0004,9039\u0003/\u00042038\u000443,\u0004\u00049\u0005\r\u000b\u0010\r\r\u000b\u0010\u0012\r\u000b\u0011\r\r\u000b\u0011\u0012\r\u000b\u0012\rh\u0003\u001909,\u001a'\u001b%\u0003\u001c:.\u0004\u0004/0,3\u001c/\u00049\u001c\u001b#\f(a)\n\n(b)\n\n(c)\n\nFigure 6: Taxicab Mobility Dataset (a) We plot the trajectories, along with their start and end points.\n(b) We evaluate the average normalized distance to various numbers of neighbors for \ufb01ve different\ntrajectory distances, and \ufb01nd that the Autowarp distance (black line) produces the most compact\nclusters (c) We apply spectral clustering with 5 different clusters (each color represents a different\ncluster) using the Autowarp learned distance.\n\ntrajectories [7]. Like the authors of [7], we preprocessed the dataset to include only those trajectories\nthat begin when a taxicab has picked up a passenger at the the Caltrain station in San Francisco, and\nwhose drop-off location is in downtown San Francisco. This leaves us T = 500 trajectories, with\na median length of N = 9. Each trajectory is 2-dimensional, consisting of an x- and y-coordinate.\nThe trajectories are plotted in Fig. 6(a). We used Autowarp (Algorithm 1 with hyperparameters\ndh = 10, S = 64, \u00afp = 1/5) to learn a warping distance from the data (\u03b1 = 0.88, \u03b3 = 0, \u0001 = 0.33).\nThis distance is similar to Euclidean distance; this may be because the GPS timestamps are regularly\nsampled. The small value of \u0001 suggests that some thresholding is needed for an optimal distance,\npossibly because of irregular stops or routes taken by the taxis.\nThe trajectories in this dataset are not labeled, so to evaluate the quality of our learned distance,\nwe compute the average distance of each trajectory to its k closest neighbors, normalized. This is\nanalogous to how to the original authors evaluated their algorithm: the lower the normalized distance,\nthe more \u201ccompact\" the clusters. We show the result of our Fig. 6(b) for various values of k, showing\nthat the learned distance is as compact as SSPD, if not more compact. We also visualize the results\nwhen our learned distance metric is used to cluster the trajectories into 5 clusters using spectral\nclustering in Fig. 6(c).\n\nAustralian Sign Language (ASL) Dataset. Next, we turn to a dataset that consists of measure-\nments taken from a smart glove worn by a sign linguist3. This dataset was used to test the EDR\ndistance metric [12]. Like the original authors, we chose a subset consisting of T = 50 trajectories,\nof median length N = 53. This subset included 10 different classes of signals. The measurements of\nthe glove are 4-dimensional, including x-, y-, and z-coordinates, along with the rotation of the palm.\nWe used Autowarp (Algorithm 1 with hyperparameters dh = 20, S = 32, \u00afp = 1/5) to learn a warping\ndistance from the data (learned distance: \u03b1 = 0.29, \u03b3 = 0.22, \u0001 = 0.48). The trajectories in this\ndataset are labeled, so to evaluate the quality of our learned distance, we computed the accuracy of\ndoing nearest neighbors on the data. Most distance functions achieve a reasonably high accuracy on\nthis task, so like the authors of [12], we added various sources of noise to the data. We evaluated\nthe learned distance, as well as the original distance metric on the noisy datasets, and \ufb01nd that the\nlearned distance is signi\ufb01cantly more robust than EDR, particularly when multiple sources of noise\nare simultaneously added, denoted as \"hybrid\" noises in Fig. 7.\n\n3Data can be downloaded from http://kdd.ics.uci.edu/databases/auslan/auslan.data.html.\n\n8\n\n\u000e\u000f\u000f\u000b\u0011\u000f\u000e\u000f\u000f\u000b\u0011\u000e\u000e\u000f\u000f\u000b\u0011\r\u000e\u000f\u000f\u000b\u0010\u0001\u000243\u0004\u00049:/0\u0010\u0001\u000b\u0001\u0001\u0010\u0001\u000b\u0001\u0001\u0010\u0001\u000b\u0001\u0001\u0010\u0001\u000b\u0001\r\u0010\u0001\u000b\u0001\u000e\u0002,9\u00049:/0\u001c3/$9,79\u000f\r\u0011\r\u0013\r\u0001\r\u000e\r\r\u000e\u000f\r\u000e\u0011\r\u001f:2-07\u000341\u000330\u0004\u0004\u0004-478\r\u000b\u000e\u0012\r\u000b\u000f\r\r\u000b\u000f\u0012\r\u000b\u0010\r\r\u000b\u0010\u0012\r\u000b\u0011\r\r\u000b\u0011\u0012\r\u000b\u0012\r\r\u000b\u0012\u0012\u0018;07,\u00040\u0003/\u000489,3.0\u000394\u000330\u0004\u0004\u0004-478\u00033472,\u0004\u0004\u00050/$$!\u001b\u001b%\u0003\u001c/\u00049\u001c\u001b#\u0018:94\u0005,75\u000e\u000f\u000f\u000b\u0011\u000f\u000e\u000f\u000f\u000b\u0011\u000e\u000e\u000f\u000f\u000b\u0011\r\u000e\u000f\u000f\u000b\u0010\u0001\u000243\u0004\u00049:/0\u0010\u0001\u000b\u0001\u0001\u0010\u0001\u000b\u0001\u0001\u0010\u0001\u000b\u0001\u0001\u0010\u0001\u000b\u0001\r\u0010\u0001\u000b\u0001\u000e\u0002,9\u00049:/0\fFigure 7: ASL Dataset. We use\nvarious distance metrics to perform\nnearest-neighbor classi\ufb01cations on\nthe ASL dataset. The original ASL\ndataset is shown on the left, and\nvarious synthetic noises have been\nadded to generate the results on the\nright. \u2018Hybrid1\u2019 is a combination of\nGaussian noise and outliers, while\n\u2018Hybrid2\u2019 refers to a combination of\nGaussian and sampling noise.\n\n6 Discussion\n\nIn this paper, we propose Autowarp, a novel method to learn a similarity metric from a dataset of\nunlabeled trajectories. Our method learns a warping distance that is similar to latent representations\nthat are learned for a trajectory by a sequence-to-sequence autoencoder. We show through systematic\nexperiments that learning an appropriate warping distance can provide insight into the nature of the\ntime series data, and can be used to cluster, query, or visualize the data effectively.\nOur experiments suggest that both steps of Autowarp \u2013 \ufb01rst, learning latent representations using\nsequence-to-sequence autoencoders, and second, \ufb01nding a warping distance that agrees with the\nlatent representation \u2013 are important to learning a good similarity metric. In particular, we carried out\nexperiments with deeper autoencoders to determine if increasing the capacity of the autoencoders\nwould allow the autoencoder alone to learn a similarity metric. Our results, some of which are shown\nin Figure S5 in Appendix A, show that even deeper autoencoders are unable to learn useful similarity\nmetrics, without the regularization afforded by restricting ourselves to a family of warping distances.\nAutowarp can be implemented ef\ufb01ciently because we have de\ufb01ned a differentiable, parametrized\nfamily of warping distances over which it is possible to do batched gradient descent. Each step of\nbatched gradient descent can be computed in time O(SN 2), where S is the batch size, and N is\nthe number of elements in a given trajectory. There are further possible improvements in speed, for\nexample, by leveraging techniques similar to FastDTW [29], which can approximate any warping\ndistance in linear time, bringing the run-time of each step of batched gradient descent to O(SN ).\nAcross different datasets and noise settings, Autowarp is able to perform as well as, and often better,\nthan the hand-crafted similarity metric designed speci\ufb01cally for the dataset and noise. For example,\nin Figure 6, we note that the Autowarp distance performs almost as well as, and in certain settings,\neven better than the SSPD metric on the Taxicab Mobility Dataset, for which the SSPD metric was\nspeci\ufb01cally crafted. Similarly, in Figure 7, we show that the Autowarp distance outperforms most\nother distances on the ASL dataset, including the EDR distance, which was validated on this dataset.\nThese results con\ufb01rm that Autowarp can learn useful distances without prior knowledge of labels\nor clusters within the data. Future work will extend these results to more challenge time series data,\nsuch as those with higher dimensionality or heterogeneous data.\n\nAcknowledgments\n\nWe are grateful to many people for providing helpful suggestions and comments in the preparation of\nthis manuscript. Brainstorming discussions with Ali Abdalla provided the initial sparks that led to the\nAutowarp algorithm, and discussions with Ali Abid were instrumental in ensuring that the formulation\nof the algorithm was clear and rigorous. Feedback from Amirata Ghorbani, Jaime Gimenez, Ruishan\nLiu, and Amirali Aghazadeh was invaluable in guiding the experiments and analyses that were carried\nout for this paper.\n\n9\n\n 7\u0004\u0004\u00043,\u0004\u0002,:88\u0004,3#08,25\u0004\u00043\u0004 :9\u0004\u0004078\u0002\u0005-7\u0004/\u000e\u0002\u0005-7\u0004/\u000f\u001f4\u000480\r\u000b\r\r\u000b\u000f\r\u000b\u0011\r\u000b\u0013\r\u000b\u0001\u000e\u000b\r\u001f0,7089\u0003\u001f0\u0004\u0004\u0004-47\u0003\u0018..:7,.\u0005\u001c:.\u001b%\u0003\u001c/\u00049\u001c\u001b#\u0018:94\u0005,75\f", "award": [], "sourceid": 6739, "authors": [{"given_name": "Abubakar", "family_name": "Abid", "institution": "Stanford"}, {"given_name": "James", "family_name": "Zou", "institution": "Stanford University"}]}