{"title": "Catching Change-points with Lasso", "book": "Advances in Neural Information Processing Systems", "page_first": 617, "page_last": 624, "abstract": "We propose a new approach for dealing with the estimation of the location of change-points in one-dimensional piecewise constant signals observed in white noise. Our approach consists in reframing this task in a variable selection context. We use a penalized least-squares criterion with a l1-type penalty for this purpose. We prove that, in an appropriate asymptotic framework, this method provides consistent estimators of the change-points. Then, we explain how to implement this method in practice by combining the LAR algorithm and a reduced version of the dynamic programming algorithm and we apply it to synthetic and real data.", "full_text": "Catching Change-points with Lasso\n\nZaid Harchaoui, C\u00b4eline L\u00b4evy-Leduc\nLTCI, TELECOM ParisTech and CNRS\n37/39 Rue Dareau, 75014 Paris, France\n{zharchao,levyledu}@enst.fr\n\nAbstract\n\nWe propose a new approach for dealing with the estimation of the location of\nchange-points in one-dimensional piecewise constant signals observed in white\nnoise. Our approach consists in reframing this task in a variable selection con-\ntext. We use a penalized least-squares criterion with a `1-type penalty for this\npurpose. We prove some theoretical results on the estimated change-points and\non the underlying piecewise constant estimated function. Then, we explain how\nto implement this method in practice by combining the LAR algorithm and a re-\nduced version of the dynamic programming algorithm and we apply it to synthetic\nand real data.\n\n1 Introduction\n\nChange-points detection tasks are pervasive in various \ufb01elds, ranging from audio [10] to EEG seg-\nmentation [5]. The goal is to partition a signal into several homogeneous segments of variable\ndurations, in which some quantity remains approximately constant over time. This issue was ad-\ndressed in a large literature (see [20] [11]), where the problem was tackled both from an online\n(sequential) [1] and an off-line (retrospective) [5] points of view. Most off-line approaches rely on a\nDynamic Programming algorithm (DP), allowing to retrieve K change-points within n observations\nof a signal with a complexity of O(Kn2) in time [11]. Such a feature refrains practitioners from\napplying these methods to large datasets. Moreover, one often observes a sub-optimal behavior of\nthe raw DP algorithm on real datasets.\n\nWe suggest here to slightly depart from this line of research, by focusing on a reformulation of\nchange-point estimation in a variable selection framework. Then, estimating change-point loca-\ntions off-line turns into performing variable selection on dummy variables representing all possible\nchange-point locations. This allows us to take advantage of the latest theoretical [23], [3] and prac-\ntical [7] advances in regression with Lasso penalty. Indeed, Lasso provides us with a very ef\ufb01cient\nmethod for selecting potential change-point locations. This selection is then re\ufb01ned by using the DP\nalgorithm to estimate the change-point locations.\n\nLet us outline the paper. In Section 2, we \ufb01rst describe our theoretical reformulation of off-line\nchange-point estimation as regression with a Lasso penalty. Then, we show that the estimated mag-\nnitude of jumps are close in mean, in a sense to be precized, to the true magnitude of jumps. We\nalso give a non asymptotic inequality to upper-bound the `2-loss of the true underlying piecewise\nconstant function and the estimated one. We describe our algorithm in Section 3. In Section 4, we\ndiscuss related works. Finally, we provide experimental evidence of the relevance of our approach.\n\n1\n\n\f2 Theoretical approach\n\n2.1 Framework\n\nk \u2019s in the following model:\nYt = \u00b5?\n\nWe describe, in this section, how off-line change-point estimation can be cast as a variable selection\nproblem. Off-line estimation of change-point locations within a signal (Yt) consists in estimating\nthe \u03c4 ?\n\nk + \u03b5t,\n\n(1)\nwhere \u03b5t are i.i.d zero-mean random variables with \ufb01nite variance. This problem can be reformulated\nas follows. Let us consider:\n\nk , 1 \u2264 k \u2264 K ? with \u03c4 ?\n\nk\u22121 + 1 \u2264 t \u2264 \u03c4 ?\n\nt = 1, . . . , n such that \u03c4 ?\n\n0 = 0,\n\nYn = Xn\u03b2n + \u03b5n\n\n(2)\nwhere Yn is a n \u00d7 1 vector of observations, Xn is a n \u00d7 n lower triangular matrix with nonzero\nj \u2019s\nn)0 is a zero-mean random vector such that the \u03b5n\nelements equal to one and \u03b5n = (\u03b5n\nare i.i.d with \ufb01nite variance. As for \u03b2n, it is a n \u00d7 1 vector having all its components equal to\nzero except those corresponding to the change-point instants. The above multiple change-point\nestimation problem (1) can thus be tackled as a variable selection one:\n\n1 , . . . , \u03b5n\n\nMinimize\n\n\u03b2\n\nkYn \u2212 Xn\u03b2k2\n\nn subject to k\u03b2k1 \u2264 s ,\n\n(3)\n\nwhere kuk1 and kukn are de\ufb01ned for a vector u = (u1, . . . , un) \u2208 Rn by kuk1 = Pn\nand kuk2\nfollowing counterpart objective in model (1):\n\nj=1 |uj|\nj respectively. Indeed, the above formulation amounts to minimize the\n\nj=1 u2\n\nn = n\u22121Pn\n\nMinimize\n\u00b51,...,\u00b5n\n\n1\nn\n\nXt=1\n\nn\n\nn\u22121\n\n(Yt \u2212 \u00b5t)2\n\nsubject to\n\n|\u00b5t+1 \u2212 \u00b5t| \u2264 s,\n\n(4)\n\nXt=1\n\nwhich consists in imposing an `1-constraint on the magnitude of jumps. The underpinning insight\nis the sparsity-enforcing property of the `1-constraint, which is expected to give a sparse vector,\nwhose non-zero components would match with those of \u03b2n and thus with change-point locations.\nIt is related to the popular Least Absolute Shrinkage eStimatOr (LASSO) in least-square regression\nof [21], used for ef\ufb01cient variable selection.\n\nIn the next section, we provide two results supporting the use of the formulation (3) for off-line\nmultiple change-point estimation. We show that estimates of jumps minimizing (3) are consistent\nin mean, and we provide a non asymptotic upper bound for the `2 loss of the underlying estimated\npiecewise constant function and the true underlying piecewise function. This inequality shows that,\nat a precized rate, the estimated piecewise constant function tends to the true piecewise constant\nfunction with a probability tending to one.\n\n2.2 Main results\n\nIn this section, we shall study the properties of the solutions of the problem (3) de\ufb01ned by\n\n(5)\n\n(6)\n\n\u02c6\u03b2n(\u03bb) = Arg min\n\n\u03b2\n\nnkYn \u2212 Xn\u03b2k2\n\nn + \u03bbk\u03b2k1o .\n\nLet us now introduce the notation sign. It maps positive entry to 1, negative entry to -1 and a null\nentry to zero. Let\n\nand let C n the covariance matrix be de\ufb01ned by\n\nA = {k, \u03b2n\n\nk 6= 0} and A = {1, . . . , n}\\A\n\nIn a general regression framework, [18] recall that, with probability tending to one, \u02c6\u03b2n(\u03bb) and \u03b2n\nhave the same sign for a well-chosen \u03bb, only if the following condition holds element-wise:\n\nC n = n\u22121X 0\n\nnXn .\n\n(7)\n\n(8)\nIJ is a sub-matrix of C n obtained by keeping rows with index in the set I and columns with\nk )k\u2208A. The condition (8) is not ful\ufb01lled in the\n\nA)\u00af\u00af < 1,\n\nwhere C n\nindex in J. The vector \u03b2n\n\nA is de\ufb01ned by \u03b2n\n\nAA)\u22121sign(\u03b2n\n\n\u00af\u00afC n\n\nA = (\u03b2n\n\nAA(C n\n\n2\n\n\fchange-point framework implying that we cannot have a perfect estimation of the change-points as\nit is already known, see [13]. But, following [18] and [3], we can prove some consistency results,\nsee Propositions 1 and 2 below.\n\nIn the following, we shall assume that the number of break points is equal to K ?.\nThe following proposition ensures that for a large enough value of n the estimated change-point\nlocations are close to the true change-points.\nProposition 1. Assume that the observations (Yn) are given by (2) and that the \u03b5n\nIf \u03bb = \u03bbn is such that \u03bbn\u221an \u2192 0 as n tends to in\ufb01nity then\n\nj \u2019s are centered.\n\nkE( \u02c6\u03b2n(\u03bbn)) \u2212 \u03b2nkn \u2192 0 .\n\nProof. We shall follow the proof of Theorem 1 in [18]. For this, we denote \u03b2n(\u03bb) the estimator\n\u02c6\u03b2n(\u03bb) under the absence of noise and \u03b3n(\u03bb) the bias associated to the Lasso estimator: \u03b3n(\u03bb) =\n\u03b2n(\u03bb) \u2212 \u03b2n. For notational simplicity, we shall write \u03b3 instead of \u03b3n(\u03bb). Note that \u03b3 satis\ufb01es the\nfollowing minimization: \u03b3 = Arg min\u03b6\u2208Rn f (\u03b6) , where\n\nf (\u03b6) = \u03b6 0C n\u03b6 + \u03bbXk\u2208A\n\nk + \u03b6k| + \u03bbXk\u2208 \u00afA\n|\u03b2n\n\n|\u03b6k| .\n\nSince f (\u03b3) \u2264 f (0), we get\n\n\u03b30C n\u03b3 + \u03bbXk\u2208A\n\nk + \u03b3k| + \u03bbXk\u2208 \u00afA\n|\u03b2n\n\nWe thus obtain using the Cauchy-Schwarz inequality the following upper bound\n\n|\u03b2n\nk | .\n\n|\u03b3k| \u2264 \u03bbXk\u2208A\n|\u03b3k|2!1/2\n\n.\n\n|\u03b3k| \u2264 \u03bb\u221aK ?\u00c3 n\nXk=1\n\n\u03b30C n\u03b3 \u2264 \u03bbXk\u2208A\nk=1 |\u03b3k|2, we obtain: k\u03b3kn \u2264 \u03bb\u221anK ?.\n\nUsing that \u03b30C n\u03b3 \u2265 n\u22121Pn\n\nThe following proposition ensures, thanks to a non asymptotic result, that the estimated underlying\npiecewise function is close to the true piecewise constant function.\nProposition 2. Assume that the observations (Yn) are given by (2) and that the \u03b5n\nGaussian random variables with variance \u03c32 > 0. Assume also that (\u03b2n\n\nj \u2019s are centered iid\nk )k\u2208A belong to (\u03b2min, \u03b2max)\nwhere \u03b2min > 0. For all n \u2265 1 and A > \u221a2 then, with a probability larger than 1 \u2212 n1\u2212A2/2, if\n\u03bbn = A\u03c3plog n/n,\n\nkXn( \u02c6\u03b2n(\u03bbn) \u2212 \u03b2n)k2\n\nn \u2264 2A\u03c3\u03b2maxK ?r log n\n\nn\n\n.\n\nProof. By de\ufb01nition of \u02c6\u03b2n(\u03bb) in (5) as a minimizer of a criterion, we have\n\nkYn \u2212 Xn \u02c6\u03b2n(\u03bb)k2\n\nn + \u03bbk \u02c6\u03b2n(\u03bb)k1 \u2264 kYn \u2212 Xn\u03b2nk2\n\nn + \u03bbk\u03b2nk1 .\n\nUsing (2), we get\n\nkXn(\u03b2n \u2212 \u02c6\u03b2n(\u03bb))k2\n\nn +\n\n2\nn\n\n(\u03b2n \u2212 \u02c6\u03b2n(\u03bb))0X 0\n\nn\u03b5n + \u03bb\n\nn\n\nXj=1\n\n| \u02c6\u03b2n\nj (\u03bb)| \u2264 \u03bb\n\nn\n\nXj=1\n\n|\u03b2n\nj | .\n\nThus,\n\nkXn(\u03b2n \u2212 \u02c6\u03b2n(\u03bb))k2\n\nn \u2264\n\n2\nn\n\nObserve that\n\n( \u02c6\u03b2n(\u03bb) \u2212 \u03b2n)0X 0\n\nn\u03b5n + \u03bbXj\u2208A\n\n(|\u03b2n\n\nj | \u2212 | \u02c6\u03b2n\n\nj (\u03bb)|) \u2212 \u03bbXj\u2208 \u00afA\n\n| \u02c6\u03b2n\nj (\u03bb)| .\n\n2\nn\n\n( \u02c6\u03b2n(\u03bb) \u2212 \u03b2n)0X 0\n\nn\u03b5n = 2\n\nn\n\nXj=1\n\n( \u02c6\u03b2n\n\nj )\uf8eb\nj (\u03bb) \u2212 \u03b2n\n\uf8ed\n\n1\nn\n\nn\n\nXi=j\n\n\u03b5n\n\ni\uf8f6\n\uf8f8 .\n\n3\n\n\fn\n\nn\n\nn\n\n\u03b5n\n\nXj=1\n\nP( \u00afE) \u2264\n\ni=j \u03b5n\nzero-mean Gaussian random variables, we obtain\n\ni\u00af\u00af\u00af \u2264 \u03bbo. Then, using the fact that the \u03b5n\nj=1nn\u22121\u00af\u00af\u00afPn\nLet us de\ufb01ne the event E = Tn\ni\u00af\u00af\u00af\u00af\u00af\u00af\n\uf8edn\u22121\u00af\u00af\u00af\u00af\u00af\u00af\n> \u03bb\uf8f6\nP\uf8eb\n2\u03c32(n \u2212 j + 1)\u00b6 .\nXj=1\nXi=j\n\uf8f8 \u2264\nThus, if \u03bb = \u03bbn = A\u03c3plog n/n,\nWith a probability larger than 1 \u2212 n1\u2212A2/2, we get\n| \u02c6\u03b2n\nj (\u03bb) \u2212 \u03b2n\n\nj |) \u2212 \u03bbnXj\u2208 \u00afA\nWe thus obtain with a probability larger than 1 \u2212 n1\u2212A2/2 the following upper bound\n\nkXn(\u03b2n \u2212 \u02c6\u03b2n(\u03bb))k2\n\nj | + \u03bbnXj\u2208A\n\nP( \u00afE) \u2264 n1\u2212A2/2 .\n\nexp\u00b5\u2212\n\nj | \u2212 | \u02c6\u03b2n\n\nn \u2264 \u03bbn\n\nXj=1\n\n(|\u03b2n\n\nn2\u03bb2\n\nn\n\nj | = 2A\u03c3r log n\n|\u03b2n\n\nn Xj\u2208A\n\nj | \u2264 2A\u03c3\u03b2maxK ?r log n\n|\u03b2n\n\nn\n\n.\n\ni \u2019s are iid\n\n| \u02c6\u03b2n\nj | .\n\nkXn(\u03b2n \u2212 \u02c6\u03b2n(\u03bb))k2\n\nn \u2264 2\u03bbnXj\u2208A\n\n3 Practical approach\n\nThe previous results need to be ef\ufb01ciently implemented to cope with \ufb01nite datasets. Our algorithm,\ncalled Cachalot (CAtching CHAnge-points with LassO), can be split into the following three steps\ndescribed hereafter.\n\nEstimation with a Lasso penalty We compute the \ufb01rst Kmax non-null coef\ufb01cients \u02c6\u03b2\u03c41, . . . , \u02c6\u03b2\u03c4Kmax\non the regularization path of the LASSO problem (3). The LAR/LASSO algorithm, as described in\n[7], provides an ef\ufb01cient algorithm to compute the entire regularization path for the LASSO problem.\n\nSincePj |\u03b2j| \u2264 s is a sparsity-enforcing constraint, the set {j, \u02c6\u03b2j 6= 0} = {\u03c4j} becomes larger as\n\nwe run through the regularization path. We shall denote by S the Kmax-selected variables:\n\n(9)\nThe computational complexity of the Kmax-long regularization path of LASSO solutions is\nmaxn). Most of the time, we can see that the Lasso effectively catches the true change-\nO(K 3\npoint but also irrelevant change-points at the vicinity of the true ones. Therefore, we propose to\nre\ufb01ne the set of change-points caught by the Lasso by performing a post-selection.\n\nS = {\u03c41, . . . , \u03c4Kmax} .\n\nmax + K 2\n\nReduced Dynamic Programming algorithm One can consider several strategies to remove ir-\nrelevant change-points from the ones retrieved by the Lasso. Among them, since usually in appli-\ncations, one is only interested in change-point estimation up to a given accuracy, we could launch\nthe Lasso on a subsample of the signal. Here, we suggest to perform post-selection by using the\nstandard Dynamic Programming algorithm (DP) thoroughly described in [11] (Chapter 12, p. 450)\nbut on the reduced set S instead of {1, . . . , n}. This algorithm allows one to ef\ufb01ciently minimize\nthe following objective for each K in {1, . . . , Kmax}:\nXk=1\n\nXi=\u03c4k\u22121+1\n\n(Yi \u2212 \u02c6\u00b5k)2,\n\ns.t \u03c41,...,\u03c4K \u2208S\n\nJ(K) =\n\n\u03c41<\u00b7\u00b7\u00b7<\u03c4K\n\n(10)\n\nMin\n\nS being de\ufb01ned in (9) and outputs for each K,\nthe corresponding subset of change-points\n(\u02c6\u03c41, . . . , \u02c6\u03c4K). The DP algorithm has a computational complexity of O(Kmax n2) if we look for\nat most Kmax change-points within the signal. Here, our reduced DP calculations (rDP) scales\nas O(Kmax K 2\nmax) where Kmax is the maximum number of change-points/variables selected by\nLAR/LASSO algorithm. Since typically Kmax \u00bf n, our method thus provides a reduction of the\ncomputational burden associated with the classical change-points detection approach which consists\nin running the DP algorithm over all the n observations.\n\n\u03c4k\n\nK\n\n4\n\n\fSelecting the number of change-points The point is now to select the adequate number of\nchange-points. As n \u2192 \u221e, according to [15], the ratio \u03c1k = J(k + 1)/J(k) should show different\nqualitative behavior when k 6 K ? and when k > K ?, K ? being the true number of change-points.\nIn particular, \u03c1k \u2265 Cn for k > K ?, where Cn \u2192 1 as n \u2192 \u221e. Actually we found out that Cn was\nclose to 1, even in small-sample settings, for various experimental designs in terms of noise variance\nand true number of change-points. Hence, conciliating theoretical guidance in large-sample setting\nand experimental \ufb01ndings in \ufb01xed-sample setting, we suggest the following rule of thumb for select-\ning the number of change-points \u02c6K : \u02c6K = Mink\u22651 {\u03c1k \u2265 1 \u2212 \u03bd} , where \u03c1k = J(k + 1)/J(k).\nCachalot Algorithm\nInput\n\n\u2022 Vector of observations Y \u2208 Rn\n\u2022 Upper bound Kmax on the number of change-points\n\u2022 Model selection threshold \u03bd\n\nProcessing\n\n1. Compute the \ufb01rst Kmax non-null coef\ufb01cients (\u03b2\u03c41 , . . . , \u03b2\u03c4Kmax ) on the regularization path\n\nwith the LAR/LASSO algorithm.\n\n2. Launch the rDP algorithm on the set of potential change-points (\u03c41, . . . , \u03c4Kmax).\n3. Select the smallest subset of the potential change-points (\u03c41, . . . , \u03c4Kmax ) selected by the rDP\n\nalgorithm for which \u03c1k \u2265 1 \u2212 \u03bd.\n\nOutput Change-point locations estimates \u02c6\u03c41, . . . , \u02c6\u03c4 \u02c6K.\n\n(Yn)\n\nTo illustrate our algorithm, we consider observations\n(2) with\n(\u03b230, \u03b250, \u03b270, \u03b290) = (5,\u22123, 4,\u22122), the other \u03b2j being equal to zero, n = 100 and \u03b5n a Gaus-\nsian random vector with a covariance matrix equal to Id, Id being a n \u00d7 n identity matrix. The\nset of the \ufb01rst nine active variables caught by the Lasso along the regularization path, i.e. the set\n{k, \u02c6\u03b2k 6= 0} is given in this case by: S = {21, 23, 28, 29, 30, 50, 69, 70, 90}. The set S contains\nthe true change-points but also irrelevant ones close to the true change-points. Moreover the most\nsigni\ufb01cant variables do not necessarily appear at the beginning. This supports the use of the re-\nduced version of the DP algorithm hereafter. Table 1 gathers the J(K), K = 1, . . . , Kmax and the\ncorresponding (\u02c6\u03c41, . . . , \u02c6\u03c4K).\n\nsatisfying model\n\nTable 1: Toy example: The empirical risk J and the estimated change-points as a function of the\npossible number of change-points K\n\nK\n0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n\nJ(K)\n696.28\n249.24\n209.94\n146.29\n120.21\n118.22\n116.97\n116.66\n116.65\n116.64\n\n(\u02c6\u03c41, . . . , \u02c6\u03c4K )\n\n\u2205\n30\n\n(30,70)\n\n(30,50,69)\n\n(30,50,70,90)\n\n(30,50,69,70,90)\n\n(21,30,50,69,70,90)\n\n(21,29,30,50,69,70,90)\n\n(21,23,29,30,50,69,70,90)\n\n(21,23,28,29,30,50,69,70,90)\n\nThe different values of the ratio \u03c1k for k = 0, . . . , 8 of the model selection procedure are given in\nTable 2. Here we took \u03bd = 0.05. We conclude, as expected, that \u02c6K = 4 and that the change-points\nare (30, 50, 70, 90), thanks to the results obtained in Table 1.\n\n4 Discussion\n\nOff-line multiple change-point estimation has recently received much attention in theoretical works,\nboth in a non-asymptotic and in an asymptotic setting by [17] and [13] respectively. From a practi-\ncal point of view, retrieving the set of change-point locations {\u03c4 ?\nK} is challenging, since it is\n\n1 , . . . , \u03c4 ?\n\n5\n\n\fTable 2: Toy example: The values of the ratio (\u03c1k = J(k + 1)/J(k), k = 0, . . . , 8)\n\nk\n\u03c1k\n\n0\n0.3580\n\n1\n0.8423\n\n2\n0.6968\n\n3\n0.8218\n\n4\n0.9834\n\n5\n0.9894\n\n6\n0.9974\n\n7\n0.9999\n\n8\n1.0000\n\nplagued by the curse of dimensionality. Indeed, all of the n observation times have to be considered\nas potential change-point instants. Yet, a dynamic programming algorithm (DP), proposed by [9]\nand [2], allows to explore all the con\ufb01gurations with a complexity of O(n3) in time. Then selecting\nthe number of change-points is usually performed thanks to a Schwarz-like penalty \u03bbnK, where\n\u03bbn has to be calibrated on data [13] [12], or a penalty K(a + b log(n/K)) as in [17] [14], where\na and b are data-driven as well. We should also mention that an abundant literature tackles both\nchange-point estimation and model selection issues from a Bayesian point of view (see [20] [8] and\nreferences therein). All approaches cited above rely on DP, or variants in Bayesian settings, and\nhence yield a computational complexity of O(n3), which makes them inappropriate for very large-\nscale signal segmentation. Moreover, despite its theoretical optimality in a maximum likelihood\nframework, raw DP may sometimes have poor performances when applied to very noisy obser-\nvations. Our alternative framework for multiple change-point estimation was previously elusively\nmentioned several times, e.g.\nin [16] [4] [19]. However up to our knowledge neither successful\npractical implementation nor theoretical grounding was given so far to support such an approach\nfor change-point estimation. Let us also mention [22], where the Fused Lasso is applied in a simi-\nlar yet different way to perform hot-spot detection. However, this approach includes an additional\npenalty, penalizing departures from the overall mean of the observations, and should thus rather be\nconsidered as an outlier detection method.\n\n5 Comparison with other methods\n\n5.1 Synthetic data\n\nWe propose to compare our algorithm with a recent method based on a penalized least-squares crite-\nrion studied by [12]. The main dif\ufb01culty in such approaches is the choice of the constants appearing\nin the penalty. In [12], a very ef\ufb01cient approach to overcome this dif\ufb01culty has been proposed: the\nchoice of the constants is completely data-driven and has been implemented in a toolbox available\nonline at http://www.math.u-psud.fr/\u02dclavielle/programs/index.html.\nIn the following, we benchmark our algorithm: A together with the latter method: B. We shall\nuse Recall and Precision as relevant performance measures to analyze the previous two algorithms.\nMore precisely, the Recall corresponds to the ratio of change-points retrieved by a method with\nthose really present in the data. As for the Precision, it corresponds to the number of change-points\nretrieved divided by the number of suggested change-points. We shall also estimate the probability\nof false alarm corresponding to the number of suggested change-points which are not present in the\nsignal divided by the number of true change-points.\nTo compute the precision and the recall of methods A and B, we ran Monte-Carlo experiments. More\nprecisely, we sampled 30 con\ufb01gurations of change-points for each real number of change-points K ?\nequal to 5, 10, 15 and 20 within a signal containing 500 observations. Change-points were at least\ndistant of 10 observations. We sampled 30 con\ufb01gurations of levels from a Gaussian distribution.\nWe used the following setting for the noise: for each con\ufb01guration of change-points and levels,\nwe synthesized a Gaussian white noise such that the standard deviation is set to a multiple of the\nminimum magnitude jump between two contiguous segments, i.e. \u03c3 = m Mink(\u00b5\u2217\nk), \u00b5?\nk\nbeing the level of the kth segment. The number of noise replications was set to 10.\nAs shown in Tables 3, 4 and 5 below, our method A yields competitive results compared to method\nB with 1 \u2212 \u03bd = 0.99 and Kmax = 50. Performances in recall are comparable whereas method A\nprovides better results than method B in terms of precision and false alarm rate.\n5.2 Real data\n\nk+1 \u2212 \u00b5\u2217\n\nIn this section, we propose to apply our method previously described to real data which have already\nbeen analyzed by Bayesian methods: the well-log data which are described in [20] and [6] and\n\n6\n\n\fTable 3: Precision of methods A and B\nK ? = 15\n\nK ? = 10\n\nK ? = 20\n\nK ? = 5\nB\n\nA\n\nA\nA\n0.95\u00b10.05 0.86\u00b10.13 0.97\u00b10.03 0.91\u00b10.09\n0.81\u00b10.15 0.71\u00b10.29 0.89\u00b10.08 0.8\u00b10.22\n0.8\u00b10.16\n0.95\u00b10.05 0.86\u00b10.13 0.97\u00b10.03 0.92\u00b10.09\n0.73\u00b10.29 0.89\u00b10.08 0.8\u00b10.21\n0.78\u00b10.17 0.71\u00b10.27 0.88\u00b10.09 0.78\u00b10.21 0.93\u00b10.06 0.85\u00b10.13 0.96\u00b10.04 0.9\u00b10.09\n0.73\u00b10.19 0.66\u00b10.28 0.84\u00b10.1\n0.93\u00b10.06 0.84\u00b10.13 0.95\u00b10.04 0.9\u00b10.1\n\n0.79\u00b10.2\n\nB\n\nB\n\nA\n\nB\n\nTable 4: Recall of methods A and B\n\nK ? = 5\nB\n\nK ? = 10\n\nK ? = 15\n\nK ? = 20\n\nA\n\nA\nB\n0.99\u00b10.02 0.99\u00b10.02 1\u00b10\n1\u00b10\n0.98\u00b10.04 0.99\u00b10.03 0.99\u00b10.01 0.99\u00b10.01 0.99\u00b10.01 0.99\u00b10.01 0.99\u00b10.01 1\u00b10\n0.95\u00b10.08 0.94\u00b10.08 0.96\u00b10.06 0.96\u00b10.05 0.97\u00b10.03 0.97\u00b10.04 0.97\u00b10.03 0.98\u00b10.02\n0.85\u00b10.16 0.87\u00b10.15 0.92\u00b10.07 0.91\u00b10.09 0.94\u00b10.06 0.94\u00b10.06 0.95\u00b10.04 0.96\u00b10.04\n\nA\n0.99\u00b10\n\nA\n0.99\u00b10\n\nB\n0.99\u00b10\n\nB\n1\u00b10\n\nTable 5: False alarm rate of methods A and B\n\nK ? = 5\nB\n\nK ? = 10\n\nK ? = 15\n\nK ? = 20\n\nA\n0.13\u00b10.03 0.23\u00b10.2\n0.13\u00b10.03 0.22\u00b10.2\n0.13\u00b10.03 0.21\u00b10.18 0.23\u00b10.03 0.32\u00b10.18 0.33\u00b10.02 0.4\u00b10.13\n0.13\u00b10.03 0.21\u00b10.2\n0.23\u00b10.03 0.29\u00b10.16 0.31\u00b10.03 0.4\u00b10.15\n\nA\n0.24\u00b10.03 0.33\u00b10.19 0.34\u00b10.02 0.42\u00b10.13 0.44\u00b10.02 0.51\u00b10.12\n0.23\u00b10.03 0.32\u00b10.18 0.33\u00b10.02 0.41\u00b10.13 0.44\u00b10.02 0.5\u00b10.11\n0.43\u00b10.03 0.5\u00b10.12\n0.42\u00b10.03 0.48\u00b10.11\n\nB\n\nB\n\nB\n\nA\n\nA\n\nMethod\nm = 0.1\nm = 0.5\nm = 1.0\nm = 1.5\n\nMethod\nm = 0.1\nm = 0.5\nm = 1.0\nm = 1.5\n\nMethod\nm = 0.1\nm = 0.5\nm = 1.0\nm = 1.5\n\ndisplayed in Figure 1. They consist in nuclear magnetic response measurements expected to carry\ninformation about rock structure and especially its strati\ufb01cation.\n\nOne distinctive feature of these data is that they typically contain a non-negligible amount of outliers.\nThe multiple change-point estimation method should then, either be used after a data cleaning step\n(median \ufb01ltering [6]), or explicitly make heavy-tailed noise distribution assumption. We restricted\nourselves to a median \ufb01ltering pre-processing. The results given by our method applied to the well-\nlog data processed with a median \ufb01lter are displayed in Figure 1 for Kmax = 200 and 1 \u2212 \u03bd = 0.99.\nThe vertical lines locate the change-points. We can note that they are close to those found out by [6]\n(P. 206) who used Bayesian techniques to perform change-points detection.\n\nx 105\n\n1.5\n\n1.4\n\n1.3\n\n1.2\n\n1.1\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\n3500\n\n4000\n\n4500\n\nx 105\n\n1.4\n\n1.35\n\n1.3\n\n1.25\n\n1.2\n\n1.15\n\n1.1\n\n1.05\n\n1\n\n0.95\n\n0.9\n0\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\n3500\n\n4000\n\n4500\n\nFigure 1: Left: Raw well-log data, Right: Change-points locations obtained with our method in\nwell-log data processed with a median \ufb01lter\n\n7\n\n\f6 Conclusion and prospects\n\nWe proposed here to cast the multiple change-point estimation as a variable selection problem. A\nleast-square criterion with a Lasso-penalty yields an ef\ufb01cient primary estimation of change-point\nlocations. Yet these change-point location estimates can be further re\ufb01ned thanks to a reduced\ndynamic programming algorithm. We obtained competitive performances on both arti\ufb01cial and real\ndata, in terms of precision, recall and false alarm. Thus, Cachalot is a computationally ef\ufb01cient\nmultiple change-point estimation method, paving the way for processing large datasets.\n\nReferences\n\n[1] M. Basseville and N. Nikiforov. The detection of abrupt changes. Information and System sciences series.\n\nPrentice-Hall, 1993.\n\n[2] R. Bellman. On the approximation of curves by line segments using dynamic programming. Communi-\n\ncations of the ACM, 4(6), 1961.\n\n[3] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Preprint 2007.\n[4] L. Boysen, A. Kempe, A. Munk, V. Liebscher, and O. Wittich. Consistencies and rates of convergence of\n\njump penalized least squares estimators. Annals of Statistics, In revision.\n\n[5] B. Brodsky and B. Darkhovsky. Non-parametric statistical diagnosis: problems and methods. Kluwer\n\nAcademic Publishers, 2000.\n\n[6] O. Capp\u00b4e, E. Moulines, and T. Ryden. Inference in Hidden Markov Models (Springer Series in Statistics).\n\nSpringer-Verlag New York, Inc., 2005.\n\n[7] B. Efron, T. Hastie, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407\u2013499, 2004.\n[8] P. Fearnhead. Exact and ef\ufb01cient bayesian inference for multiple changepoint problems. Statistics and\n\nComputing, 16:203\u2013213, 2006.\n\n[9] W. D. Fisher. On grouping for maximum homogeneity. Journal of the American Statistical Society,\n\n53:789\u2013798, 1958.\n\n[10] O. Gillet, S. Essid, and G. Richard. On the correlation of automatic audio and visual segmentation of\n\nmusic videos. IEEE Transactions on Circuits and Systems for Video Technology, 2007.\n\n[11] S. M. Kay. Fundamentals of statistical signal processing: detection theory. Prentice-Hall, Inc., 1993.\n[12] M. Lavielle. Using penalized contrasts for the change-points problems. Signal Processing, 85(8):1501\u2013\n\n1510, 2005.\n\n[13] M. Lavielle and E. Moulines. Least-squares estimation of an unknown number of shifts in a time series.\n\nJournal of time series analysis, 21(1):33\u201359, 2000.\n\n[14] E. Lebarbier. Detecting multiple change-points in the mean of a gaussian process by model selection.\n\nSignal Processing, 85(4):717\u2013736, 2005.\n\n[15] C.-B. L. Lee. Estimating the number of change-points in a sequence of independent random variables.\n\nStatistics and Probability Letters, 25:241\u2013248, 1995.\n\n[16] E. Mammen and S. Van De Geer. Locally adaptive regression splines. Annals of Statistics, 1997.\n[17] P. Massart. A non asymptotic theory for model selection. pages 309\u2013323. European Mathematical Society,\n\n2005.\n\n[18] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data.\n\nPreprint 2006.\n\n[19] S. Rosset and J. Zhu. Piecewise linear regularized solution paths. Annals of Statistics, 35, 2007.\n[20] J. Ruanaidh and W. Fitzgerald. Numerical Bayesian Methods Applied to Signal Processing. Statistics and\n\nComputing. Springer, 1996.\n\n[21] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[22] R. Tibshirani and P. Wang. Spatial smoothing and hot spot detection for cgh data using the fused lasso.\n\nBiostatistics, 9(1):18\u201329, 2008.\n\n[23] P. Zhao and B. Yu. On model selection consistency of lasso. Journal Of Machine Learning Research, 7,\n\n2006.\n\n8\n\n\f", "award": [], "sourceid": 700, "authors": [{"given_name": "C\u00e9line", "family_name": "Levy-leduc", "institution": null}, {"given_name": "Za\u00efd", "family_name": "Harchaoui", "institution": null}]}