{"title": "Reducing statistical time-series problems to binary classification", "book": "Advances in Neural Information Processing Systems", "page_first": 2060, "page_last": 2068, "abstract": "We show how binary classification methods developed to work on i.i.d. data can be used for solving statistical problems that are seemingly unrelated to classification and concern highly-dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. The algorithms that we construct for solving these problems are based on a new metric between time-series distributions, which can be evaluated using binary classification methods. Universal consistency of the proposed algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data.", "full_text": "Reducing statistical time-series problems to binary\n\nclassi\ufb01cation\n\nDaniil Ryabko\n\nSequeL-INRIA/LIFL-CNRS,\nUniversit\u00b4e de Lille, France\ndaniil@ryabko.net\n\nJ\u00b4er\u00b4emie Mary\n\nSequeL-INRIA/LIFL-CNRS,\nUniversit\u00b4e de Lille, France\n\nJeremie.Mary@inria.fr\n\nAbstract\n\nWe show how binary classi\ufb01cation methods developed to work on i.i.d. data can\nbe used for solving statistical problems that are seemingly unrelated to classi\ufb01-\ncation and concern highly-dependent time series. Speci\ufb01cally, the problems of\ntime-series clustering, homogeneity testing and the three-sample problem are ad-\ndressed. The algorithms that we construct for solving these problems are based\non a new metric between time-series distributions, which can be evaluated using\nbinary classi\ufb01cation methods. Universal consistency of the proposed algorithms\nis proven under most general assumptions. The theoretical results are illustrated\nwith experiments on synthetic and real-world data.\n\n1\n\nIntroduction\n\nBinary classi\ufb01cation is one of the most well-understood problems of machine learning and statistics:\na wealth of ef\ufb01cient classi\ufb01cation algorithms has been developed and applied to a wide range of\napplications. Perhaps one of the reasons for this is that binary classi\ufb01cation is conceptually one of\nthe simplest statistical learning problems. It is thus natural to try and use it as a building block for\nsolving other, more complex, newer or just different problems; in other words, one can try to obtain\nef\ufb01cient algorithms for different learning problems by reducing them to binary classi\ufb01cation. This\napproach has been applied to many different problems, starting with multi-class classi\ufb01cation, and\nincluding regression and ranking [3, 16], to give just a few examples. However, all of these problems\nare formulated in terms of independent and identically distributed (i.i.d.) samples. This is also the\nassumption underlying the theoretical analysis of most of the classi\ufb01cation algorithms.\nIn this work we consider learning problems that concern time-series data for which independence\nassumptions do not hold. The series can exhibit arbitrary long-range dependence, and different time-\nseries samples may be interdependent as well. Moreover, the learning problems that we consider \u2014\nthe three-sample problem, time-series clustering, and homogeneity testing \u2014 at \ufb01rst glance seem\ncompletely unrelated to classi\ufb01cation.\nWe show how the considered problems can be reduced to binary classi\ufb01cation methods. The results\ninclude asymptotically consistent algorithms, as well as \ufb01nite-sample analysis. To establish the con-\nsistency of the suggested methods, for clustering and the three-sample problem the only assumption\nthat we make on the data is that the distributions generating the samples are stationary ergodic; this\nis one of the weakest assumptions used in statistics. For homogeneity testing we have to make some\nmixing assumptions in order to obtain consistency results (this is indeed unavoidable [22]). Mixing\nconditions are also used to obtain \ufb01nite-sample performance guarantees for the \ufb01rst two problems.\nThe proposed approach is based on a new distance between time-series distributions (that is, be-\ntween probability distributions on the space of in\ufb01nite sequences), which we call telescope distance.\nThis distance can be evaluated using binary classi\ufb01cation methods, and its \ufb01nite-sample estimates\nare shown to be asymptotically consistent. Three main building blocks are used to construct the tele-\n\n1\n\n\fscope distance. The \ufb01rst one is a distance on \ufb01nite-dimensional marginal distributions. The distance\nwe use for this is the following: dH(P, Q) := suph\u2208H |EP h \u2212 EQh| where P, Q are distributions\nand H is a set of functions. This distance can be estimated using binary classi\ufb01cation methods,\nand thus can be used to reduce various statistical problems to the classi\ufb01cation problem. This dis-\ntance was previously applied to such statistical problems as homogeneity testing and change-point\nestimation [14]. However, these applications so far have only concerned i.i.d. data, whereas we\nwant to work with highly-dependent time series. Thus, the second building block are the recent\nresults of [1, 2], that show that empirical estimates of dH are consistent (under certain conditions\non H) for arbitrary stationary ergodic distributions. This, however, is not enough: evaluating dH\nfor (stationary ergodic) time-series distributions means measuring the distance between their \ufb01nite-\ndimensional marginals, and not the distributions themselves. Finally, the third step to construct the\ndistance is what we call telescoping. It consists in summing the distances for all the (in\ufb01nitely many)\n\ufb01nite-dimensional marginals with decreasing weights.\nWe show that the resulting distance (telescope distance) indeed can be consistently estimated based\non sampling, for arbitrary stationary ergodic distributions. Further, we show how this fact can be\nused to construct consistent algorithms for the considered problems on time series. Thus we can\nharness binary classi\ufb01cation methods to solve statistical learning problems concerning time series.\nTo illustrate the theoretical results in an experimental setting, we chose the problem of time-series\nclustering, since it is a dif\ufb01cult unsupervised problem which seems most different from the prob-\nlem of binary classi\ufb01cation. Experiments on both synthetic and real-world data are provided. The\nreal-world setting concerns brain-computer interface (BCI) data, which is a notoriously challenging\napplication, and on which the presented algorithm demonstrates competitive performance.\nA related approach to address the problems considered here, as well some related problems about\nstationary ergodic time series, is based on (consistent) empirical estimates of the distributional dis-\ntance, see [23, 21, 13] and [8] about the distributional distance. The empirical distance is based on\ncounting frequencies of bins of decreasing sizes and \u201ctelescoping.\u201d A similar telescoping trick is\nused in different problems, e.g. sequence prediction [19]. Another related approach to time-series\nanalysis involves a different reduction, namely, that to data compression [20].\nOrganisation. Section 2 is preliminary. In Section 3 we introduce and discuss the telescope dis-\ntance. Section 4 explains how this distance can be calculated using binary classi\ufb01cation methods.\nSections 5 and 6 are devoted to the three-sample problem and clustering, respectively. In Section 7,\nunder some mixing conditions, we address the problems of homogeneity testing, clustering with\nunknown k, and \ufb01nite-sample performance guarantees. Section 8 presents experimental evaluation.\nSome proofs are deferred to the supplementary material.\n\n2 Notation and de\ufb01nitions\nLet (X ,FX ) be a measurable space (the domain). Time-series (or process) distributions are proba-\n,FN) of one-way in\ufb01nite sequences (where FN is the Borel sigma-\nbility measures on the space (X\nN). We use the abbreviation X1..k for X1, . . . , Xk. All sets and functions introduced\nalgebra of X\nbelow (in particular, the sets Hk and their elements) are assumed measurable.\nA distribution \u03c1 is stationary if \u03c1(X1..k \u2208 A) = \u03c1(Xn+1..n+k \u2208 A) for all A \u2208 FX k, k, n \u2208 N\n(with FX k being the sigma-algebra of X k). A stationary distribution is called (stationary) ergodic\nIXi..i+k\u2208A = \u03c1(A) \u03c1-a.s. for every A \u2208 FX k, k \u2208 N. (This de\ufb01nition,\nif limn\u2192\u221e 1\nn\nwhich is more suited for the purposes of this work, is equivalent to the usual one expressed in terms\nof invariant sets, see e.g. [8].)\n\ni=1..n\u2212k+1\n\nN\n\n(cid:80)\n\n3 A distance between time-series distributions\nWe start with a distance between distributions on X , and then we will extend it to distributions on\nX \u221e. For two probability distributions P and Q on (X ,F) and a set H of measurable functions on\nX , one can de\ufb01ne the distance\n\ndH(P, Q) := sup\nh\u2208H\n\n|EP h \u2212 EQh|.\n\n2\n\n\fSpecial cases of this distance are Kolmogorov-Smirnov [15], Kantorovich-Rubinstein [11] and\nFortet-Mourier [7] metrics; the general case has been studied since at least [26].\nWe will be interested in the cases where dH(P, Q) = 0 implies P = Q. Note that in this case dH\nis a metric (the rest of the properties are easy to see). For reasons that will become apparent shortly\n(see Remark below), we will be mainly interested in the sets H that consist of indicator functions.\nIn this case we can identify each f \u2208 H with the set {x : f (x) = 1} \u2282 X and (by a slight abuse\nof notation) write dH(P, Q) := suph\u2208H |P (h) \u2212 Q(h)|. It is easy to check that in this case dH is\na metric if and only if H generates F. The latter property is often easy to verify directly. First\nof all, it trivially holds for the case where H is the set of halfspaces in a Euclidean X . It is also\neasy to check that it holds if H is the set of halfspaces in the feature space of most commonly used\nkernels (provided the feature space is of the same or higher dimension than the input space), such as\npolynomial and Gaussian kernels.\nBased on dH we can construct a distance between time-series probability distributions. For two\ntime-series distributions \u03c11, \u03c12 we take the dH between k-dimensional marginal distributions of \u03c11\nand \u03c12 for each k \u2208 N, and sum them all up with decreasing weights.\nDe\ufb01nition 1 (telescope distance D). For two time series distributions \u03c11 and \u03c12 on the space\n(X\u221e,F\u221e) and a sequence of sets of functions H = (H1,H2, . . . ) de\ufb01ne the telescope distance\n\nwk sup\nh\u2208Hk\n\nDH(\u03c11, \u03c12) :=\n\n|E\u03c11h(X1, . . . , Xk) \u2212 E\u03c12h(Y1, . . . , Yk)|,\nwhere wk, k \u2208 N is a sequence of positive summable real weights (e.g. wk = 1/k2).\nLemma 1. DH is a metric if and only if dHk is a metric for every k \u2208 N.\nProof. The statement follows from the fact that two process distributions are the same if and only if\nall their \ufb01nite-dimensional marginals coincide.\nDe\ufb01nition 2 (empirical telescope distance \u02c6D). For a pair of samples X1..n and Y1..m de\ufb01ne empir-\nical telescope distance as\n\u02c6DH(X1..n, Y1..m) :=\n\n(1)\n\n\u221e(cid:88)\n\nk=1\n\nwk sup\nh\u2208Hk\n\n1\n\nn \u2212 k + 1\n\nh(Xi..i+k\u22121) \u2212\n\n1\n\nm \u2212 k + 1\n\nh(Yi..i+k\u22121)\n\n(2)\n\nmin{m,n}(cid:88)\n\nk=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nn\u2212k+1(cid:88)\n\ni=1\n\nm\u2212k+1(cid:88)\n\ni=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) .\n\nAll the methods presented in this work are based on the empirical telescope distance. The key fact\nis that it is an asymptotically consistent estimate of the telescope distance, that is, the latter can be\nconsistently estimated based on sampling.\nTheorem 1. Let H = (H1,H2, . . . ), Hk \u2282 X k, k \u2208 N be a sequence of separable sets of indicator\nfunctions of \ufb01nite VC dimension such that Hk generates FX k. Then, for every stationary ergodic\ntime series distributions \u03c1X and \u03c1Y generating samples X1..n and Y1..m we have\n\nlim\n\nn,m\u2192\u221e\n\n\u02c6DH(X1..n, Y1..m) = DH(\u03c1X , \u03c1Y )\n\n(3)\n\nThe proof is deferred to the supplementary material. Note that \u02c6DH is a biased estimate of DH, and,\nunlike in the i.i.d. case, the bias may depend on the distributions; however, the bias is o(n).\nRemark. The condition that the sets Hk are sets of indicator function of \ufb01nite VC dimension\ncomes from [2], where it is shown that for any stationary ergodic distribution \u03c1, under these\nconditions, suph\u2208Hk\nh(Xi..i+k\u22121) is an asymptotically consistent estimate of\nE\u03c1h(X1, . . . , Xk). This fact implies that dH can be consistently estimated, from which the the-\norem is derived.\n\n(cid:80)n\u2212k+1\n\nn\u2212k+1\n\ni=1\n\n1\n\n4 Calculating \u02c6DH using binary classi\ufb01cation methods\n\nThe methods for solving various statistical problems that we suggest are all based on \u02c6DH. The main\nappeal of this approach is that \u02c6DH can be calculated using binary classi\ufb01cation methods. Here we\nexplain how to do it.\n\n3\n\n\fm\u2212k+1(cid:88)\n\nm\u2212k+1(cid:88)\n\ni=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ,\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nn\u2212k+1(cid:88)\n\ni=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nn\u2212k+1(cid:88)\n\ni=1\n\nThe de\ufb01nition (2) of DH involves calculating l summands (where l := min{n, m}), that is\n\n1\n\nn \u2212 k + 1\n\nh(Xi..i+k\u22121) \u2212\n\n1\n\nsup\nh\u2208Hk\n\n(4)\nfor each k = 1..l. Assuming that h \u2208 Hk are indicator functions, calculating each of the summands\namounts to solving the following k-dimensional binary classi\ufb01cation problem. Consider Xi..i+k\u22121,\ni = 1..n \u2212 k + 1 as class-1 examples and Yi..i+k\u22121, i = 1..m \u2212 k + 1 as class-0 examples. The\nsupremum (4) is attained on h \u2208 Hk that minimizes the empirical risk, with examples wighted with\nrespect to the sample size. Indeed, then we can de\ufb01ne the weighted empirical risk of any h \u2208 Hk as\n\nm \u2212 k + 1\n\nh(Yi..i+k\u22121)\n\ni=1\n\n1\n\nn \u2212 k + 1\n\n(1 \u2212 h(Xi..i+k\u22121)) +\n\n1\n\nm \u2212 k + 1\n\nh(Yi..i+k\u22121)\n\nwhich is obviously minimized by any h \u2208 Hk that attains (4).\nThus, as long as we have a way to \ufb01nd h \u2208 Hk that minimizes empirical risk, we have a consistent\nestimate of DH(\u03c1X , \u03c1Y ), under the mild conditions on H required by Theorem 1. Since the di-\nmension of the resulting classi\ufb01cation problems grows with the length of the sequences, one should\nprefer methods that work in high dimensions, such as soft-margin SVMs [6].\nA particularly remarkable feature is that the choice of Hk is much easier for the problems that we\nconsider than in the binary classi\ufb01cation problem. Speci\ufb01cally, if (for some \ufb01xed k) the classi\ufb01er\nthat achieves the minimal (Bayes) error for the classi\ufb01cation problem is not in Hk, then obviously\nthe error of an empirical risk minimizer will not tend to zero, no matter how much data we have. In\ncontrast, all we need to achieve asymptotically 0 error in estimating \u02c6D (and therefore, in the learning\nproblems considered below) is that the sets Hk asymptotically generate FX k and have a \ufb01nite VC\ndimension (for each k). This is the case already for the set of hyperplanes in Rk! Thus, while the\nchoice of Hk (or, say, of the kernel to use in SVM) is still important from the practical point of view,\nit is almost irrelevant for the theoretical consistency results. Thus, we have the following.\nClaim 1. The approximation error |DH(P, Q) \u2212 \u02c6DH(X, Y )|, and thus the error of the algorithms\nbelow, can be much smaller than the error of classi\ufb01cation algorithms used to calculate DH(X, Y ).\n\nFinally, we remark that while in (2) the number of summands is l, it can be replaced with any \u03b3l such\nthat \u03b3l \u2192 \u221e, without affecting any asymptotic consistency results. A practically viable choice is\n\u03b3l = log l; in fact, there is no reason to choose faster growing \u03b3n since the estimates for higher-order\nsummands will not have enough data to converge. This is also the value we use in the experiments.\n\n5 The three-sample problem\n\nWe start with a conceptually simple problem known in statistics as the three-sample problem (some\ntimes also called time-series classi\ufb01cation). We are given three samples X = (X1, . . . , Xn),\nY = (Y1, . . . , Ym) and Z = (Z1, . . . , Zl). It is known that X and Y were generated by differ-\nent time-series distributions, whereas Z was generated by the same distribution as either X or Y . It\nis required to \ufb01nd out which one is the case. Both distributions are assumed to be stationary ergodic,\nbut no further assumptions are made about them (no independence, mixing or memory assump-\ntions). The three sample-problem for dependent time series has been addressed in [9] for Markov\nprocesses and in [23] for stationary ergodic time series. The latter work uses an approach based on\nthe distributional distance.\nIndeed, to solve this problem it suf\ufb01ces to have consistent estimates of some distance between time\nseries distributions. Thus, we can use the telescope distance. The following statement is a simple\ncorollary of Theorem 1.\nTheorem 2. Let the samples X = (X1, . . . , Xn), Y = (Y1, . . . , Ym) and Z = (Z1, . . . , Zl) be\ngenerated by stationary ergodic distributions \u03c1X , \u03c1Y and \u03c1Z, with \u03c1X (cid:54)= \u03c1Y and either (i) \u03c1Z = \u03c1X\nor (ii) \u03c1Z = \u03c1Y . Assume that the sets Hk \u2282 X k, k \u2208 N are separable sets of indicator functions of\n\ufb01nite VC dimension such that Hk generates FX k. A test that declares (i) if \u02c6DH(Z, X) \u2264 \u02c6DH(Z, Y )\nand (ii) otherwise makes only \ufb01nitely many errors with probability 1 as n, m, l \u2192 \u221e.\nIt is straightforward to extend this theorem to more than two classes; in other words, instead of X\nand Y one can have an arbitrary number of samples from different stationary ergodic distributions.\n\n4\n\n\f6 Clustering time series\n\n1 , . . . , X 1\nn1\n\n1 , . . . , X N\nnN\n\n), . . . , X N = (X N\n\n) generated by k dif-\nWe are given N samples X 1 = (X 1\nferent stationary ergodic time-series distributions \u03c11, . . . , \u03c1k. The number k is known, but the dis-\ntributions are not. It is required to group the N samples into k groups (clusters), that is, to output\na partitioning of {X1..XN} into k sets. While there may be many different approaches to de\ufb01ne\nwhat is a good clustering (and, in general, deciding what is a good clustering is a dif\ufb01cult problem),\nfor the problem of classifying time-series samples there is a natural choice, proposed in [21]: those\nsamples should be put together that were generated by the same distribution. Thus, de\ufb01ne target\nclustering as the partitioning in which those and only those samples that were generated by the same\ndistribution are placed in the same cluster. A clustering algorithm is called asymptotically consistent\nif with probability 1 there is an n(cid:48) such that the algorithm produces the target clustering whenever\nmaxi=1..N ni \u2265 n(cid:48).\nAgain, to solve this problem it is enough to have a metric between time-series distributions that can\nbe consistently estimated. Our approach here is based on the telescope distance, and thus we use \u02c6D.\nThe clustering problem is relatively simple if the target clustering has what is called the strict sepa-\nration property [4]: every two points in the same target cluster are closer to each other than to any\npoint from a different target cluster. The following statement is an easy corollary of Theorem 1.\nTheorem 3. Assume that the sets Hk \u2282 X k, k \u2208 N are separable sets of indicator functions of \ufb01nite\nVC dimension, such that Hk generates FX k. If the distributions \u03c11, . . . , \u03c1k generating the samples\n) are stationary ergodic, then with probability 1\nX 1 = (X 1\nfrom some n := maxi=1..N ni on the target clustering has the strict separation property with respect\nto \u02c6DH.\n\n1 , . . . , X 1\nn1\n\n), . . . , X N = (X N\n\n1 , . . . , X N\nnN\n\nWith the strict separation property at hand, it is easy to \ufb01nd asymptotically consistent algorithms.\nWe will give some simple examples, but the theorem below can be extended to many other distance-\nbased clustering algorithms.\nThe average linkage algorithm works as follows. The distance between clusters is de\ufb01ned as the\naverage distance between points in these clusters. First, put each point into a separate cluster. Then,\nmerge the two closest clusters; repeat the last step until the total number of clusters is k. The farthest\npoint clustering works as follows. Assign c1 := X 1 to the \ufb01rst cluster. For i = 2..k, \ufb01nd the point\nX j, j \u2208 {1..N} that maximizes the distance mint=1..i \u02c6DH(X j, ct) (to the points already assigned\nto clusters) and assign ci := X j to the cluster i. Then assign each of the remaining points to the\nnearest cluster. The following statement is a corollary of Theorem 3.\nTheorem 4. Under the conditions of Theorem 3, average linkage and farthest point clusterings are\nasymptotically consistent.\n\nNote that we do not require the samples to be independent; the joint distributions of the samples may\nbe completely arbitrary, as long as the marginal distribution of each sample is stationary ergodic.\nThese results can be extended to the online setting in the spirit of [13].\n\n7 Speed of convergence\n\nThe results established so far are asymptotic out of necessity: they are established under the as-\nsumption that the distributions involved are stationary ergodic, which is too general to allow for\nany meaningful \ufb01nite-time performance guarantees. Moreover, some statistical problems, such as\nhomogeneity testing or clustering when the number of clusters is unknown, are provably impossible\nto solve under this assumption [22].\nWhile it is interesting to be able to establish consistency results under such general assumptions, it\nis also interesting to see what results can be obtained under stronger assumptions. Moreover, since\nit is usually not known in advance whether the data at hand satis\ufb01es given assumptions or not, it\nappears important to have methods that have both asymptotic consistency in the general setting and\n\ufb01nite-time performance guarantees under stronger assumptions.\nIn this section we will look at the speed of convergence of \u02c6D under certain mixing conditions, and\nuse it to construct solutions for the problems of homogeneity and clustering with an unknown num-\n\n5\n\n\f(cid:16)\n\nn\u2212k+1(cid:88)\n\n(cid:17)\nh(Xi..i+k\u22121) \u2212 E\u03c11 h(X1..k)| > \u03b5\n\nqn(\u03c1,Hk, \u03b5) := \u03c1\n\n|\n\n1\n\nn \u2212 k + 1\n\nsup\nh\u2208Hk\n\ni=1\n\nber of clusters, as well as to establish \ufb01nite-time performance guarantees for the methods presented\nin the previous sections.\nA stationary distribution on the space of one-way in\ufb01nite sequences (X N\nextended to a stationary distribution on the space of two-way in\ufb01nite sequences (X Z\nform . . . , X\u22121, X0, X1, . . . .\nDe\ufb01nition 3 (\u03b2-mixing coef\ufb01cients). For a process distribution \u03c1 de\ufb01ne the mixing coef\ufb01cients\n\n,FN) can be uniquely\n,FZ) of the\n\n\u03b2(\u03c1, k) :=\n\nsup\n\nA\u2208\u03c3(X\u2212\u221e..0),\nB\u2208\u03c3(Xk..\u221e)\n\n|\u03c1(A \u2229 B) \u2212 \u03c1(A)\u03c1(B)|\n\nwhere \u03c3(..) denotes the sigma-algebra of the random variables in brackets.\nWhen \u03b2(\u03c1, k) \u2192 0 the process \u03c1 is called absolutely regular; this condition is much stronger than\nergodicity, but is much weaker than the i.i.d. assumption.\n\n7.1 Speed of convergence of \u02c6D\n\nAssume that a sample X1..n is generated by a distribution \u03c1 that is uniformly \u03b2-mixing with coef\ufb01-\ncients \u03b2(\u03c1, k) Assume further that Hk is a set of indicator functions with a \ufb01nite VC dimension dk,\nfor each k \u2208 N.\nThe general tool that we use to obtain performance guarantees in this section is the following bound\nthat can be obtained from the results of [12].\n\n\u2264 n\u03b2(\u03c1, tn \u2212 k) + 8tdk+1\n\ne\u2212ln\u03b52/8,\n\n(5)\nwhere tn are any integers in 1..n and ln = n/tn. The parameters tn should be set according to the\nvalues of \u03b2 in order to optimize the bound.\nOne can use similar bounds for classes of \ufb01nite Pollard dimension [18] or more general bounds\nexpressed in terms of covering numbers, such as those given in [12]. Here we consider classes\nof \ufb01nite VC dimension only for the ease of the exposition and for the sake of continuity with the\nprevious section (where it was necessary).\nFurthermore, for the rest of this section we assume geometric \u03b2-mixing distributions, that is,\n\u03b2(\u03c1, t) \u2264 \u03b3t for some \u03b3 < 1. Letting ln = tn =\n\nn the bound (5) becomes\n\n\u221a\n\nn\n\nqn(\u03c1,Hk, \u03b5) \u2264 n\u03b3\n\n(6)\nLemma 2. Let two samples X1..n and Y1..m be generated by stationary distributions \u03c1X and \u03c1Y\nwhose \u03b2-mixing coef\ufb01cients satisfy \u03b2(\u03c1., t) \u2264 \u03b3t for some \u03b3 < 1. Let Hk, k \u2208 N be some sets of\nindicator functions on X k whose VC dimension dk is \ufb01nite and non-decreasing with k. Then\n\nn\u03b52/8.\n\n\u221a\n\nn\u2212k + 8n(dk+1)/2e\u2212\u221a\n\nP (| \u02c6DH(X1..n, Y1..m) \u2212 DH(\u03c1X , \u03c1Y )| > \u03b5) \u2264 2\u2206(\u03b5/4, n(cid:48))\n\nwhere n(cid:48) := min{n1, n2}, the probability is with respect to \u03c1X \u00d7 \u03c1Y and\n\u221a\nn+log(\u03b5) + 8n(d\u2212 log \u03b5+1)/2e\u2212\u221a\n\n\u2206(\u03b5, n) := \u2212 log \u03b5(n\u03b3\n\nn\u03b52/8).\n\n(7)\n\n(8)\n\n7.2 Homogeneity testing\n\nGiven two samples X1..n and Y1..m generated by distributions \u03c1X and \u03c1Y respectively, the problem\nof homogeneity testing (or the two-sample problem) consists in deciding whether \u03c1X = \u03c1Y . A test\nis called (asymptotically) consistent if its probability of error goes to zero as n(cid:48) := min{m, n} goes\nto in\ufb01nity. In general, for stationary ergodic time series distributions, there is no asymptotically\nconsistent test for homogeneity [22], so stronger assumptions are in order.\nHomogeneity testing is one of the classical problems of mathematical statistics, and one of the most\nstudied ones. Vast literature exits on homogeneity testing for i.i.d. data, and for dependent processes\n\n6\n\n\fas well. We do not attempt to survey this literature here. Our contribution to this line of research is\nto show that this problem can be reduced (via the telescope distance) to binary classi\ufb01cation, in the\ncase of strongly dependent processes satisfying some mixing conditions.\nIt is easy to see that under the mixing conditions of Lemma 1 a consistent test for homogeneity exists,\nand \ufb01nite-sample performance guarantees can be obtained. It is enough to \ufb01nd a sequence \u03b5n \u2192 0\nsuch that \u2206(\u03b5n, n) \u2192 0 (see (8). Then the test can be constructed as follows: say that the two se-\nquences X1..n and Y1..m were generated by the same distribution if \u02c6DH(X1..n, Y1..m) < \u03b5min{n,m};\notherwise say that they were generated by different distributions. The following statement is an im-\nmediate consequence of Lemma 2.\nTheorem 5. Under the conditions of Lemma 2 the probability of Type I error (the distributions are\nthe same but the test says they are different) of the described test is upper-bounded by 4\u2206(\u03b5/8, n(cid:48)).\nThe probability of Type II error (the distributions are different but the test says they are the same) is\nupper-bounded by 4\u2206(\u03b4 \u2212 \u03b5/8, n(cid:48)) where \u03b4 := 1/2DH(\u03c1X , \u03c1Y ).\nThe optimal choice of \u03b5n may depend on the speed at which dk (the VC dimension of Hk) increases;\nhowever, for most natural cases (recall that Hk are also parameters of the algorithm) this growth is\npolynomial so the main term to control is e\u2212\u221a\nFor example, if Hk is the set of halfspaces in X k = Rk then dk = k + 1 and one can chose\n\u03b5n := n\u22121/8. The resulting probability of Type I error decreases as exp(\u2212n1/4).\n\nn\u03b52/8.\n\n7.3 Clustering with a known or unknown number of clusters\n\nIf the distributions generating the samples satisfy certain mixing conditions, then we can augment\nTheorems 3 and 4 with \ufb01nite-sample performance guarantees.\nTheorem 6.\n1 , . . . , X 1\n(X 1\nn1\n\u03b4 := mini,j=1..N,i(cid:54)=j DH(\u03c1i, \u03c1j) and n := mini=1..N ni. Then with probability at least\n\n) satisfy the conditions of Lemma 2.\n\n), . . . , X N = (X N\n\nsamples X 1\n\nLet\n\nthe\n\ndistributions\n\n=\nDe\ufb01ne\n\ngenerating\n\nthe\n\n\u03c11, . . . , \u03c1k\n1 , . . . , X N\nnN\n1 \u2212 N (N \u2212 1)\u2206(\u03b4/4, n)/2\n\nthe target clustering of the samples has the strict separation property. In this case single linkage\nand farthest point algorithms output the target clustering.\n\nProof. Note that a suf\ufb01cient condition for the strict separation property to hold is that for every one\nout of N (N \u2212 1)/2 pairs of samples the estimate \u02c6DH(X i, X j) i, j = 1..N is within \u03b4/4 of the DH\ndistance between the corresponding distributions. It remains to apply Lemma 2 to obtain the \ufb01rst\nstatement, and the second statement is obvious (cf. Theorem 4).\n\nAs with homogeneity testing, while in the general case of stationary ergodic distributions it is im-\npossible to have a consistent clustering algorithm when the number of clusters k is unknown, the\nsituation changes if the distributions satisfy certain mixing conditions. In this case a consistent clus-\ntering algorithm can be obtained as follows. Assign to the same cluster all samples that are at most\n\u03b5n-far from each other, where the threshold \u03b5n is selected the same way as for homogeneity testing:\n\u03b5n \u2192 0 and \u2206(\u03b5n, n) \u2192 0. The optimal choice of this parameter depends on the choice of Hk\nthrough the speed of growth of the VC dimension dk of these sets.\nTheorem 7. Given N samples generated by k different stationary distributions \u03c1i, i = 1..k (un-\nknown k) all satisfying the conditions of Lemma 2, the probability of error (misclustering at least\none sample) of the described algorithm is upper-bounded by\n\n2N (N \u2212 1) max{\u2206(\u03b5/8, n), \u2206(\u03b4 \u2212 \u03b5/8, n)}\n\nwhere \u03b4 := mini,j=1..k,i(cid:54)=j DH(\u03c1i, \u03c1j) and n = mini=1..N ni, with ni, i = 1..N being lengths of\nthe samples.\n\n8 Experiments\n\nFor experimental evaluation we chose the problem of time-series clustering. Average-linkage clus-\ntering is used, with the telescope distance between samples calculated using an SVM, as described\n\n7\n\n\fin Section 4. In all experiments, SVM is used with radial basis kernel, with default parameters of\nlibsvm [5].\n\n8.1 Synthetic data\n\nFor the arti\ufb01cial setting we have chosen highly-dependent time series distributions which have the\nsame single-dimensional marginals and which cannot be well approximated by \ufb01nite- or countable-\nstate models. The distributions \u03c1(\u03b1), \u03b1 \u2208 (0, 1), are constructed as follows. Select r0 \u2208 [0, 1]\nuniformly at random; then, for each i = 1..n obtain ri by shifting ri\u22121 by \u03b1 to the right, and\nremoving the integer part. The time series (X1, X2, . . . ) is then obtained from ri by drawing a point\nfrom a distribution law N1 if ri < 0.5 and from N2 otherwise. N1 is a 3-dimensional Gaussian with\nmean of 0 and covariance matrix Id\u00d71/4. N2 is the same but with mean 1. If \u03b1 is irrational1 then the\ndistribution \u03c1(\u03b1) is stationary ergodic, but does not belong to any simpler natural distribution family\n[25]. The single-dimensional marginal is the same for all values of \u03b1. The latter two properties\nmake all parametric and most non-parametric methods inapplicable to this problem.\nIn our experiments, we use two process distributions \u03c1(\u03b1i), i \u2208 {1, 2}, with \u03b11 = 0.31..., \u03b12 =\n0.35...,. The dependence of error rate on the length of time series is shown on Figure 1. One\nclustering experiment on sequences of length 1000 takes about 5 min. on a standard laptop.\n\n8.2 Real data\n\nTo demonstrate the applicability of the proposed methods to realistic scenarios, we chose the brain-\ncomputer interface data from BCI competition III [17]. The dataset consists of (pre-processed)\nBCI recordings of mental imagery: a person is thinking about one of three subjects (left foot, right\nfoot, a random letter). Originally, each time series consisted of several consecutive sequences of\ndifferent classes, and the problem was supervised: three time series for training and one for testing.\nWe split each of the original time series into classes, and then used our clustering algorithm in a\ncompletely unsupervised setting. The original problem is 96-dimensional, but we used only the \ufb01rst\n3 dimensions (using all 96 gives worse performance). The typical sequence length is 300. The\nperformance is reported in Table 1, labeled TSSVM. All the computation for this experiment takes\napproximately 6 minutes on a standard laptop.\nThe following methods were used for comparison. First, we used dynamic time wrapping (DTW)\n[24] which is a popular base-line approach for time-series clustering. The other two methods in\nTable 1 are from [10]. The comparison is not fully relevant, since the results in [10] are for different\nsettings; the method KCpA was used in change-point estimation method (a different but also un-\nsupervised setting), and SVM was used in a supervised setting. The latter is of particular interest\nsince the classi\ufb01cation method we used in the telescope distance is also SVM, but our setting is\nunsupervised (clustering).\n\ns2\n\ns1\n\ns3\nTSSVM 84% 81% 61%\n46% 41% 36%\nDTW\nKCpA\n79% 74% 61%\n76% 69% 60%\nSVM\n\nFigure 1: Error of two-class clustering using\nTSSVM; 10 time series in each target cluster, av-\neraged over 20 runs.\n\nTable 1: Clustering accuracy in the BCI\ndataset.\n3 subjects (columns), 4 methods\n(rows). Our method is TSSVM.\n\nAcknowledgments. This research was funded by the Ministry of Higher Education and Research, Nord-Pas-\nde-Calais Regional Council and FEDER (Contrat de Projets Etat Region CPER 2007-2013), ANR projects\nEXPLO-RA (ANR-08-COSI-004), Lampada (ANR-09-EMER-007) and CoAdapt, and by the European Com-\nmunity\u2019s FP7 Program under grant agreements n\u25e6 216886 (PASCAL2) and n\u25e6 270327 (CompLACS).\n\n1in the experiments simulated by a longdouble with a long mantissa\n\n8\n\n0200400600800100012000.00.10.20.30.4Time of observationError rate\fReferences\n[1] T. M. Adams and A. B. Nobel. Uniform convergence of Vapnik-Chervonenkis classes under\n\nergodic sampling. The Annals of Probability, 38:1345\u20131367, 2010.\n\n[2] T. M. Adams and A. B. Nobel. Uniform approximation of Vapnik-Chervonenkis classes.\n\nBernoulli, 18(4):1310\u20131319, 2012.\n\n[3] M.-F. Balcan, N. Bansal, A. Beygelzimer, D. Coppersmith, J. Langford, and G. Sorkin. Robust\nreductions from ranking to classi\ufb01cation. In COLT\u201907, v. 4539 of LNCS, pages 604\u2013619. 2007.\n[4] M.F. Balcan, A. Blum, and S. Vempala. A discriminative framework for clustering via simi-\n\nlarity functions. In STOC, pp. 671\u2013680. ACM, 2008.\n\n[5] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM\nTransactions on Intelligent Systems and Technology, 2:27:1\u201327:27, 2011. Software available\nat http://www.csie.ntu.edu.tw/\u02dccjlin/libsvm.\n\n[6] C. Cortes and V. Vapnik. Support-vector networks. Mach. Learn., 20(3):273\u2013297, 1995.\n[7] R. Fortet and E. Mourier. Convergence de la r\u00b4epartition empirique vers la r\u00b4epartition\n\nth\u00b4eoretique. Ann. Sci. Ec. Norm. Super., III. Ser, 70(3):267\u2013285, 1953.\n\n[8] R. Gray. Probability, Random Processes, and Ergodic Properties. Springer Verlag, 1988.\n[9] M. Gutman. Asymptotically optimal classi\ufb01cation for multiple tests with empirically observed\n\nstatistics. IEEE Transactions on Information Theory, 35(2):402\u2013408, 1989.\n\n[10] Z. Harchaoui, F. Bach, E. Moulines. Kernel change-point analysis. NIPS, pp. 609\u2013616, 2008.\n[11] L. V. Kantorovich and G. S. Rubinstein. On a function space in certain extremal problems.\n\nDokl. Akad. Nauk USSR, 115(6):1058\u20131061, 1957.\n\n[12] R.L. Karandikara and M. Vidyasagar. Rates of uniform convergence of empirical means with\n\nmixing processes. Statistics and Probability Letters, 58:297\u2013307, 2002.\n\n[13] A. Khaleghi, D. Ryabko, J. Mary, and P. Preux. Online clustering of processes. In AISTATS,\n\nJMLR W&CP 22, pages 601\u2013609, 2012.\n\n[14] D. Kifer, S. Ben-David, J. Gehrke. Detecting change in data streams. VLDB (v.30): 180\u2013191,\n\n2004.\n\n[15] A.N. Kolmogorov. Sulla determinazione empirica di una legge di distribuzione. G. Inst. Ital.\n\nAttuari, pages 83\u201391, 1933.\n\n[16] John Langford, Roberto Oliveira, and Bianca Zadrozny. Predicting conditional quantiles via\n\nreduction to classi\ufb01cation. In UAI, 2006.\n\n[17] Jos\u00b4e del R. Mill\u00b4an. On the need for on-line learning in brain-computer interfaces. In Proc. of\n\nthe Int. Joint Conf. on Neural Networks, 2004.\n\n[18] D. Pollard. Convergence of Stochastic Processes. Springer, 1984.\n[19] B. Ryabko. Prediction of random sequences and universal coding. Problems of Information\n\nTransmission, 24:87\u201396, 1988.\n\n[20] B. Ryabko. Compression-based methods for nonparametric prediction and estimation of some\ncharacteristics of time series. IEEE Transactions on Information Theory, 55:4309\u20134315, 2009.\n\n[21] D. Ryabko. Clustering processes. In Proc. ICML 2010, pp. 919\u2013926, Haifa, Israel, 2010.\n[22] D. Ryabko. Discrimination between B-processes is impossible. Journal of Theoretical Proba-\n\nbility, 23(2):565\u2013575, 2010.\n\n[23] D. Ryabko and B. Ryabko. Nonparametric statistical inference for ergodic processes. IEEE\n\nTransactions on Information Theory, 56(3):1430\u20131435, 2010.\n\n[24] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recog-\n\nnition. IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1):43\u201349, 1978.\n\n[25] P. Shields. The Ergodic Theory of Discrete Sample Paths. AMS Bookstore, 1996.\n[26] V. M. Zolotarev. Metric distances in spaces of random variables and their distributions. Math.\n\nUSSR-Sb, 30(3):373\u2013401, 1976.\n\n9\n\n\f", "award": [], "sourceid": 1022, "authors": [{"given_name": "Daniil", "family_name": "Ryabko", "institution": null}, {"given_name": "Jeremie", "family_name": "Mary", "institution": null}]}