{"title": "Sparsistent Learning of Varying-coefficient Models with Structural Changes", "book": "Advances in Neural Information Processing Systems", "page_first": 1006, "page_last": 1014, "abstract": "To estimate the changing structure of a varying-coefficient varying-structure (VCVS) model remains an important and open problem in dynamic system modelling, which includes learning trajectories of stock prices, or uncovering the topology of an evolving gene network. In this paper, we investigate sparsistent learning of a sub-family of this model --- piecewise constant VCVS models. We analyze two main issues in this problem: inferring time points where structural changes occur and estimating model structure (i.e., model selection) on each of the constant segments. We propose a two-stage adaptive procedure, which first identifies jump points of structural changes and then identifies relevant covariates to a response on each of the segments. We provide an asymptotic analysis of the procedure, showing that with the increasing sample size, number of structural changes, and number of variables, the true model can be consistently selected. We demonstrate the performance of the method on synthetic data and apply it to the brain computer interface dataset. We also consider how this applies to structure estimation of time-varying probabilistic graphical models.", "full_text": "Sparsistent Learning of Varying-coef\ufb01cient Models\n\nwith Structural Changes\n\nMladen Kolar, Le Song and Eric P. Xing \u2217\n\nSchool of Computer Science, Carnegie Mellon University\n\n{mkolar,lesong,epxing}@cs.cmu.edu\n\nAbstract\n\nTo estimate the changing structure of a varying-coef\ufb01cient varying-structure\n(VCVS) model remains an important and open problem in dynamic system mod-\nelling, which includes learning trajectories of stock prices, or uncovering the\ntopology of an evolving gene network. In this paper, we investigate sparsistent\nlearning of a sub-family of this model \u2014 piecewise constant VCVS models. We\nanalyze two main issues in this problem: inferring time points where structural\nchanges occur and estimating model structure (i.e., model selection) on each of\nthe constant segments. We propose a two-stage adaptive procedure, which \ufb01rst\nidenti\ufb01es jump points of structural changes and then identi\ufb01es relevant covariates\nto a response on each of the segments. We provide an asymptotic analysis of\nthe procedure, showing that with the increasing sample size, number of structural\nchanges, and number of variables, the true model can be consistently selected. We\ndemonstrate the performance of the method on synthetic data and apply it to the\nbrain computer interface dataset. We also consider how this applies to structure\nestimation of time-varying probabilistic graphical models.\n\n1 Introduction\nConsider the following regression model:\nYi = X\u2032\n\ni = 1, . . . , n,\n\ni\u03b2(ti) + \u01ebi,\n\n(1)\nwhere the design variables Xi \u2208 Rp are i.i.d. zero mean random variables sampled at some con-\nditions indexed by i = 1, . . . , n, such as the prices of a set of stocks at time i, or the signals from\nsome sensors deployed at location i; the noise \u01eb1, . . . , \u01ebn are i.i.d. Gaussian variables with variance\n\u03c32 independent of the design variables; and \u03b2(ti) = (\u03b21(ti), . . . , \u03b2p(ti))\u2032 : [0, 1] 7\u2192 Rp is a vector\nof unknown coef\ufb01cient functions. Since the coef\ufb01cient vector is a function of the conditions rather\nthan a constant, such a model is called a varying-coef\ufb01cient model [12]. Varying-coef\ufb01cient models\nare a non-parametric extension to the linear regression models, which unlike other non-parametric\nmodels, assume that there is a linear relationship (generalizable to log-linear relationship) between\nthe feature variables and the output variable, albeit a changing one. The model given in Eq. (1) has\nthe \ufb02exibility of a non-parametric model and the interpretability of an ordinary linear regression.\n\nVarying-coef\ufb01cient models were popularized in the work of [9] and [16]. Since then, they have been\napplied to a variety of domains, including multidimensional regression, longitudinal and functional\ndata analysis, and modeling problems in econometrics and \ufb01nance, to model and predict time- or\nspace- varying response to multidimensional inputs (see e.g. [12] for an overview.) One can easily\nimagine a more general form of such a model applicable to these domains, where both the coef\ufb01cient\nvalue and the model structure change with values of other variables. We refer to this class of models\nas varying-coef\ufb01cient varying-structure (VCVS) models. The more challenging problem of structure\nrecovery (or model selection) under VCVS has started to catch attention very recently [1, 24].\n\n\u2217LS is supported by a Ray and Stephenie Lane Research Fellowship. EPX is supported by grant ONR\nN000140910758, NSF DBI-0640543, NSF DBI-0546594, NSF IIS-0713379 and an Alfred P. Sloan Research\nFellowship. We also thank Za\u00a8\u0131d Harchaoui for useful discussions.\n\n1\n\n\f)\nt\n(\n\n1\n\n\u03b2\n\n)\nt\n(\n\n2\n\n\u03b2\n\n)\nt\n(\n\np\n\n\u03b2\n\n0.5\n0\n\u22120.5\n0.5\n0\n\u22120.5\n\n0.5\n0\n\u22120.5\n0\n\n0.2\n\n\u2026\n\nY\n\n.\n\n.\n\n.\n\n0.4\n\n0.6\n\nTime t (i/n)\n\n0.8\n\n1\n\n)\nt\n(\n\n1\n\n\u03b2\n\n)\nt\n(\n\n2\n\n\u03b2\n\n)\nt\n(\n\np\n\n\u03b2\n\n0.5\n0\n\u22120.5\n0.5\n0\n\u22120.5\n\n0.5\n0\n\u22120.5\n0\n\n0.2\n\n\u03b21\n\n\u03b22\n\n\u03b2p\n\n)\nt\n(\n\n1\n\n\u03b2\n\n)\nt\n(\n\n2\n\n\u03b2\n\n\u2026\n\n)\nt\n(\n\np\n\n\u03b2\n\n0.8\n\n1\n\n0.4\n\n0.6\n\nTime t (i/n)\n\n0.5\n0\n\u22120.5\n0.5\n0\n\u22120.5\n\n0.5\n0\n\u22120.5\n0\n\n0.2\n\n\u2026\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nTime t (i/n)\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: (a) Illustration of an VCVS as varying functions of time. The interval [0, 1] is partitioned into\n{0, 0.25, 0.4, 0.7, 1}, which de\ufb01nes blocks on which the coef\ufb01cient functions are constant. At different blocks\nonly covariates with non-zero coef\ufb01cient affect the response, e.g. on the interval B2 = (0.25, 0.4) covariates\nX2 and Xp do not affect response. (b) Schematic representation of the covariates affecting the response during\nthe second block in panel (a), which is reminiscent of neighborhood selection in graph structure learning. (c)\nand (d) Application of VCVS for graph structure estimation (see Section 7) of non-piecewise constant evolving\ngraphs. Coef\ufb01cients de\ufb01ning neighborhoods of different nodes can change on different partitions.\n\nIn this paper, we analyze VCVS as functions of time, and the main goal is to estimate the dynamic\nstructure and jump points of the unknown vector function \u03b2(t). To be more speci\ufb01c, we consider the\ncase where the function \u03b2(t) is time-varying, but piecewise constant (see Fig. 1), i.e., there exists\na partition T = {T1 = 0 < T2 < . . . < TB = 1}, 1 < B \u2264 n, of the time interval (scaled to)\n[0, 1], such that \u03b2(t) = \u03b3j, t \u2208 [Tj\u22121, Tj) for some constant vectors \u03b3j \u2208 Rp, j = 1, . . . , B. We\nrefer to points T1, . . . , TB as jump points. Furthermore, we assume that at each time point ti only\na few covariates affect the response, i.e., the vector \u03b2(ti) is sparse. A good estimation procedure\nwould be able to identify the correct partition of the interval [0, 1] so that within each segment the\ncoef\ufb01cient function is constant. In addition, the procedure can identify active coef\ufb01cients and their\nvalues within each segment, i.e., the time-varying structure of the model. This estimation problem\nis particularly important in applications where one needs to uncover dynamic relational information\nor model structures from time series data. For example, one may want to infer at chosen time points\nthe (changing) set of stocks that are predictive of a particular stock one has been holding from a\ntime series of all stock prices; or to understand the evolving circuitry of gene regulation at different\ngrowth stages of an organism that determines the activity of a target gene based on other regulative\ngenes, based on time series of microarray data. Another important problem is to identify structural\nchanges in \ufb01elds such as signal processing, EEG segmentation and analysis of seismic signals. In\nall these problems, the goal is not to estimate the optimum value of \u03b2(t) for predicting Y , but to\nconsistently uncover the zero and non-zero patterns in \u03b2(t) at time points of interest that reveal the\nchanging structure of the model. In this paper, we provide a new algorithm to achieve this goal, and\na theoretical analysis that proves the asymptotic consistency of our algorithm.\n\nOur problem is remotely related to, but very different from, earlier works on linear regression models\nwith structural changes [4], and the problem of change-point detection (e.g. [19]), which can also\nbe analyzed in the framework of varying-coef\ufb01cient models. A number of existing methods are\navailable to identify only one structural change in the data; in order to identify multiple changes\nthese methods can be applied sequentially on smaller intervals that are assumed to harbor only one\nchange [14]. Another common approach is to assume that there are K changes and use Dynamic\nProgramming to estimate them [4]. In this paper, we propose and analyze a penalized least squares\napproach, which automatically adapts to the unknown number of structural changes present in the\ndata and performs the variable selection on each of the constant regions.\n\n2 Preliminaries\n\nFor a varying-coef\ufb01cient regression model described in Eq. (1) with structural changes, a reason-\nable estimator of the time-varying structure can be obtained by minimizing the so-called TESLA\n(temporally smoothed L1-regularized regression) loss proposed in [1]: (for simplicity we suppress\nthe sample-size notation n in the regularization constants \u03bbn = {\u03bbn\n2}, but it should be clear that\ntheir values depend on n)\n\n1 , \u03bbn\n\nn\n\nn\n\np\n\n(Yi \u2212 X\u2032\n\n\u02c6\u03b2(t1; \u03bb), . . . , \u02c6\u03b2(tn; \u03bb) = arg min\n\u03b2\n\nXk=1\nwhere ||\u00b7||1 denotes the \u21131 norm, and ||\u00b7||TV denotes a total variation norm:\n||\u03b2k||TV =\nPn\ni=2 |\u03b2k(ti) \u2212 \u03b2k(ti\u22121)|. From the analysis of [20], it is known that each component function\n\n||\u03b2(ti)||1 + 2\u03bb2\n\n||\u03b2k||TV ,\n\ni\u03b2(ti))2 + 2\u03bb1\n\nXi=1\n\nXi=1\n\n(2)\n\n2\n\n\f\u03b2k can be chosen as a piecewise constant and right continuous function, i.e., \u03b2k is a spline function,\nwith potential jump points at observation times ti, i = 1, . . . , n. In this particular case, the total\nvariation penalty de\ufb01ned above allows us to conceptualize \u03b2k as a vector in Rn, whose components\n\u03b2k,i \u2261 \u03b2k(ti) correspond to function values at ti, i = 1, . . . , n, but not as a function [0, 1] 7\u2192 R. We\ncontinue to use the vector representation through the rest of the paper as it will simplify the notation.\n\nThe estimation problem de\ufb01ned in Eq. (2) has a few appealing properties. The objective function on\nthe right-hand-side is convex and there exists a solution \u02c6\u03b2, which can be found ef\ufb01ciently using a\nstandard convex optimization package. Furthermore, the penalty terms in Eq. (2) are constructed in\na way to perform model selection. Observe that \u21131 penalty encourages sparsity of the signal at each\ntime point and enables a selection over the relevant coef\ufb01cients; whereas the total variation penalty\nis used to partition the interval [0, 1] so that \u02c6\u03b2k is constant within each segment. However, there are\nalso some drawbacks of the procedure, as shown in Lemma 1 below.\n\nLet\u2019s start with some notational clari\ufb01cations. Let X denote the design matrix, input observation\nXi at time i corresponds to the i-th row in X. For simplicity, we assume throughout the paper\nthat X are normalized to have unit length columns, i.e., each dimension has unit Euclidean norm.\nLet Bj, j = 1, . . . , B, denote the set of time points that fall into the interval [Tj\u22121, Tj); when the\nmeaning is clear from the context, we also use Bj as a shorthand of this interval. For example, XBj\nand YBj represent the submatrix of X and subvector of Y , respectively, that include elements only\ncorresponding to time points within interval Bj. For a given solution \u02c6\u03b2 to Eq. (2), there exists a\nblock partition \u02c6T = { \u02c6T1, . . . , \u02c6T \u02c6B} of [0, 1] (possibly a trivial one) and unique vectors \u02c6\u03b3j \u2208 Rp, j =\n1, . . . , \u02c6B, such that \u02c6\u03b2k,i = \u02c6\u03b3j,k for ti \u2208 \u02c6Bj. The set of relevant covariates during inverval Bj, i.e., the\nsupport of vector \u03b3j, is denoted as SBj = {k | \u03b3j,k 6= 0}. Likewise we de\ufb01ne \u02c6S \u02c6Bj\nBy construction, no consecutive vectors \u02c6\u03b3j and \u02c6\u03b3j+1 are identical. Note that both the number of\npartitions \u02c6B = | \u02c6T |, and the elements in the partition \u02c6T , are random quantities. The following\nlemma characterizes the vectors \u02c6\u03b3j using the subgradient equation of Eq. (2).\nLemma 1 Let \u02c6\u03b3j and \u02c6Bj, j = 1, . . . , \u02c6B be vectors and segments obtained from a minimizer of\nEq. (2). Then each \u02c6\u03b3j can be found as a solution to the subgradient equation:\n\nover \u02c6\u03b3j.\n\nwhere\n\nX\u2032\n\n\u02c6Bj\n\nX \u02c6Bj\n\nj + \u03bb2\u02c6s(TV)\n\nj\n\n= 0,\n\n\u02c6Bj\n\nY \u02c6Bj\n\n+ \u03bb1| \u02c6Bj|\u02c6s(1)\n\n\u02c6\u03b3j \u2212 X\u2032\n\u02c6s(1)\nj \u2208 \u2202 ||\u02c6\u03b3j||1 = sign(\u03b3j),\n\nby convention sign(0) \u2208 [\u22121, 1], and \u02c6s(TV)\nif \u02c6\u03b32,k \u2212 \u02c6\u03b31,k > 0\nif \u02c6\u03b32,k \u2212 \u02c6\u03b31,k < 0\n\n1,k =(cid:26) \u22121\n\n\u02c6s(TV)\n\n1\n\nj\n\nand, for 1 < j < \u02c6B,\n\n\u2208 Rp such that\n\n,\n\n\u02c6s(TV)\n\u02c6B,k\n\n=(cid:26) 1 if \u02c6\u03b3 \u02c6B,k \u2212 \u02c6\u03b3 \u02c6B\u22121,k > 0\n\u22121 if \u02c6\u03b3 \u02c6B,k \u2212 \u02c6\u03b3 \u02c6B\u22121,k < 0\n\n\u02c6s(TV)\n\nj,k =( 2\n\u22122\n0\n\nif \u02c6\u03b3j+1,k \u2212 \u02c6\u03b3j,k > 0, \u02c6\u03b3j,k \u2212 \u02c6\u03b3j\u22121,k < 0\nif \u02c6\u03b3j+1,k \u2212 \u02c6\u03b3j,k < 0, \u02c6\u03b3j,k \u2212 \u02c6\u03b3j\u22121,k > 0\nif (\u02c6\u03b3j,k \u2212 \u02c6\u03b3j\u22121,k)(\u02c6\u03b3j+1,k \u2212 \u02c6\u03b3j,k) = 1.\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\nLemma 1 does not provide a practical way to estimate \u02c6\u03b2TV, but it does characterize a solution.\nFrom Eq. (3) we can see that the coef\ufb01cients in each of the estimated blocks are biased by two terms\ncoming from the \u21131 and ||\u00b7||TV penalties. The larger the estimated segments, the smaller the relative\nin\ufb02uence of the bias from the total variation, while the magnitude of the bias introduced by the \u21131\npenalty is uniform across different segments. The additional bias coming from the total variation\npenalty was also noted in the problem of signal denoising [23]. In the next section, we introduce a\ntwo step procedure which alleviate this effect.\n\n3 A two-step procedure for estimating time-varying structures\n\nIn this section, we propose a new algorithm for estimating the time-varying structure of the varying-\ncoef\ufb01cient model in Eq. (1), which does not suffer from the bias introduced by minimizing the\nobjective in Eq. (2). The algorithm is a two-step procedure summarized as follows:\n\n3\n\n\f1. Estimate the block partition \u02c6T , on which the coef\ufb01cient vector is constant within each\n\nblock. This can be obtained by minimizing the following objective:\n\nn\n\nXi=1\n\np\n\nXk=1\n\n(Yi \u2212 X\u2032\n\ni\u03b2(ti))2 + 2\u03bb2\n\n||\u03b2k||TV ,\n\n(7)\n\nwhich we refer to as a temporal difference (TD) regression for reasons that will be clear\nshortly. We will employ a TD-transformation to Eq. (7) and turn it into an \u21131-regularized\nregression problem, and solve it using the randomized Lasso. Details of the algorithm and\nhow to extract \u02c6T from the TD-estimate will be given shortly.\nobjective within the block:\n\n2. For each block of the partition, \u02c6Bj, 1 \u2264 j \u2264 \u02c6B, estimate \u02c6\u03b3j by minimizing the Lasso\n\n\u02c6\u03b3j = argmin\n\n\u03b3\u2208Rp Xti\u2208 \u02c6Bj\n\n(Yi \u2212 X\u2032\n\ni\u03b3)2 + 2\u03bb1 ||\u03b3||1 .\n\n(8)\n\nWe name this procedure TDB-Lasso (or TDBL), after the two steps (TD randomized Lasso, and\nLasso within Blocks) given above. The advantage of the TDB-Lasso compared to a minimizer of\nEq. (2) comes from decoupling the interactions between the \u21131 and TV penalties (note that the two\nprocedures result in different estimates). Now we discuss step 1 in detail; step 2 is straightforward\nusing a standard Lasso toolbox.\nTo obtain a consistent estimate of \u02c6T from the TD-regression in Eq. (7), we can transform Eq. (7)\ninto an equivalent \u21131 penalized regression problem, which allows us to cast the \u02c6T estimation\nproblem as a feature selection problem. Let \u03b2\u2020\nk,i denote the temporal difference between the re-\ngression coef\ufb01cients corresponding to the same covariate k at successive time points ti\u22121 and\nti: \u03b2\u2020\nk,i \u2261 \u03b2k(ti) \u2212 \u03b2k(ti\u22121), k = 1, . . . , p, i = 1, . . . , n with \u03b2k(t0) = 0, by conven-\nIt can be shown that the model in Eq. (1) can be expressed as Y \u2020 = X\u2020\u03b2\u2020 + \u01eb\u2020, where\ntion.\nY \u2020 \u2208 Rn is a transformed vector of the TDs of responses, i.e., each element Y \u2020\ni \u2261 Yi \u2212 Yi\u22121;\nX\u2020 = (X\u2020\np) \u2208 Rn\u00d7np is the transformed design matrix with lower triangular matrices\nX\u2020\nk \u2208 Rn\u00d7n corresponding to TD features computed from the covariates; \u01eb\u2020 \u2208 Rn is the trans-\nformed TD-error vector; and \u03b2\u2020 \u2208 Rnp is a vector obtained by stacking TD-coef\ufb01cient vectors \u03b2\u2020\nk.\n(See Appendix for more details of the transformation.) Note that the elements of the vector \u01eb\u2020 are\nnot i.i.d. any more. Using the transformation above, the estimation problem de\ufb01ned on objective\nEq. (7) can be expressed in the following matrix form:\n\n1, . . . , X\u2020\n\n(9)\n\n\u02c6\u03b2\u2020 = argmin\n\n2\n\n\u03b2\u2208Rnp (cid:12)(cid:12)(cid:12)(cid:12)Y \u2020 \u2212 X\u2020\u03b2\u2020(cid:12)(cid:12)(cid:12)(cid:12)\n\n2 + 2\u03bb2(cid:12)(cid:12)(cid:12)(cid:12)\u03b2\u2020(cid:12)(cid:12)(cid:12)(cid:12)1 .\n\nThis transformation was proposed in [8] in the context of one-dimensional signal denoising, how-\never, we are interested in the estimation of jump points in the context of time-varying coef\ufb01cient\nmodel.\n\nThe estimator de\ufb01ned in Eq. (9) is not robust with respect to small perturbations of data, i.e., small\nchanges of variables Xi or Yi would result in a different \u02c6T . To deal with the problem of robustness,\nwe employed the stability selection procedure of [22] (see also the bootstrap Lasso [2], however,\nwe have decided to use the stability selection because of the weaker assumptions). The stability\nselection approach to estimating the jump-points is comprised of two main components: i) simulat-\ning multiple datasets using bootstrap, and ii) using the randomized Lasso outlined in Algorithm 1\n(see also Appendix) to solve (9). While the bootstrap step improves the robustness of the estimator,\nthe randomized Lasso weakens the conditions under which the estimator \u02c6\u03b2\u2020 selects exactly the true\nfeatures.\nLet { \u02c6\u03b2\u2020\nb=1 represent the set of estimates and their supports (i.e., index of non-zero elements)\nobtained by minimizing (9) for each of the M bootstrapped datasets. We obtain a stable estimate of\nthe support by selecting variables that appear in multiple supports\n\nb , \u02c6J \u2020\n\nb }M\n\n\u02c6J \u03c4 = {k | PM\n\nb=1 1I{k \u2208 \u02c6J \u2020\nb }\n\nM\n\n\u2265 \u03c4},\n\n(10)\n\nwhich is then used to obtain the block partition estimate \u02c6T . The parameter \u03c4 is a tuning parameter\nthat controls the number of falsely identi\ufb01ed jump points.\n\n4\n\n\fi=1 Xi \u2208 Rp, penalty parameter \u03bb, weakness parameter \u03b1 \u2208 (0, 1]\n\nAlgorithm 1 Randomized Lasso\nInput: Dataset {Xi, Yi}n\nOutput: Estimate \u02c6\u03b2 \u2208 Rp, support \u02c6S\n1: Choose randomly p weights {Wk}p\n2: \u02c6\u03b2 = argmin\u03b2\u2208Rp Pn\n3: \u02c6S = {k | \u02c6\u03b2k 6= 0}\n\nk=1 from interval [\u03b1, 1]\n\ni=1(Yi \u2212 Xi\u03b2)2 + 2\u03bb Pp\n\nk=1\n\n|\u03b2k|\nWk\n\n4 Theoretical analysis\nWe provide a theoretical analysis of TDB-Lasso, and show that under certain conditions both the\njump points and structure of VCVS can be consistently estimated. Proofs are deferred to Appendix.\n\n4.1 Estimating jump points\n\nWe \ufb01rst address the issue of estimating jump points by analyzing the transformed TD-regression\nproblem Eq. (9) and its feature selection properties. The feature selection using \u21131 penalization has\nbeen analyzed intensively over the past few years and we can adapt some of the existing results to\nthe problem at hand. To prove that all the jump points are included in \u02c6J \u03c4 , we \ufb01rst state a sparse\neigenvalue condition on the design (e.g. [6]). The minimal and maximal sparse eigenvalue, for\nmatrix X \u2208 Rn\u00d7p, are de\ufb01ned as\n\n\u03d5min(k, X) :=\n\ninf\n\na\u2208Rp,||a||0\u2264k\n\n,\n\n\u03d5max(k, X) :=\n\nsup\n\na\u2208Rp,||a||0\u2264k\n\n||Xa||2\n||a||2\n\n||Xa||2\n||a||2\n\n,\n\nk \u2264 p.\n\n(11)\n\nNote that in Eq. (11) eigenvalues are computed over submatrices of size k (i.e., due to the constraint\non a by the ||\u00b7||0 norm). We can now express the sparse eigenvalues condition on the design.\nA1: Let J \u2020 be the true support of \u03b2\u2020 and J = |J \u2020|. There exist some C > 1 and \u03ba \u2265 10 such that\n(12)\n\n< \u221aC/\u03ba.\n\n\u03d5max(CJ 2, X\u2020)\n\u03d53/2\nmin(CJ 2, X\u2020)\n\nThis condition guarantees a correlation structure between TD-transformed covariates that allows for\ndetection of the jump points. Comparing to the irrepresentible condition [30, 21, 27], necessary for\nthe ordinary Lasso to perform feature selection, condition A1 is much weaker [22] and is suf\ufb01cient\nfor the randomized Lasso to select the relevant feature with high probability (see also [26]).\n\nTheorem 1 Let A1 be satis\ufb01ed; and let the weakness \u03b1 be given as \u03b12 = \u03bd\u03d5min(CJ 2, X\u2020)/(CJ 2),\n\nfor any \u03bd \u2208 (7/\u03ba, 1/\u221a2). If the minimum size of the jump is bounded away from zero as\n\nk\u2208J \u2020 |\u03b2\u2020\nmin\nwhere \u03bbmin = 2\u03c3\u2020(\u221aCJ + 1)q log np\ni ), for np > 10 and J \u2265 7, there exists\nsome \u03b4 = \u03b4J \u2208 (0, 1) such that for all \u03c4 \u2265 1 \u2212 \u03b4, the collection of the estimated jump points \u02c6J \u03c4\nsatis\ufb01es,\n(14)\n\nk| \u2265 0.3(CJ)3/2\u03bbmin,\nand \u03c3\u20202\n\n\u2265 V ar(Y \u2020\n\n(13)\n\nn\n\nP( \u02c6J \u03c4 = J \u2020) \u2265 1 \u2212 5/np.\n\nRemark: Note that Theorem 1 gives conditions under which we can recover every jump point in\nevery covariates. In particular, there are no assumptions on the number of covariates that change\nvalues at a jump point. Assuming that multiple covariates change their values at a jump point, we\ncould further relax the condition on the minimal size of a jump given in Eq. (13). It was also pointed\nto us that the framework of [18] may be a more natural way to estimate jump points.\n\n4.2\n\nIdentifying correct covariates\n\nNow we address the issue of selecting the relevant features for every estimated segment. Under\nthe conditions of Theorem 1, correct jump points will be detected with probability arbitrarily close\nto 1. That means under the assumption A1, we can run the regular Lasso on each of the estimated\nsegments to select the relevant features therein. We will assume that the mutual coherence condition\n[10] holds for each segment Bj. Let \u03a3j = 1\n\nkl = (\u03a3j)k,l.\n\nXi, with \u03c3j\n\nX\u2032\ni\n\n|Bj |Pi\u2208Bj\n\n5\n\n\f)! = 1.\n\n(15)\n\nd\n\n(cid:12)(cid:12)SBj(cid:12)(cid:12)\n\nA2: We assume there is a constant 0 < d \u2264 1 such that\nk\u2208SBj ,l6=k (|\u03c3j\n\nP max\n\nkl| \u2264\n\nThe assumption A2 is a mild version of the mutual coherence condition used in [7], which is neces-\nsary for identi\ufb01cation of the relevant covariates in each segment. Let \u02c6\u03b3j, k = 1, . . . , \u02c6Bn denote the\nLasso estimates for each segment obtained by minimizing (8).\n\nTheorem 2 Let A2 be satis\ufb01ed. Also, assume that the conditions of Theorem 1 are satis\ufb01ed. Let\nK = max1\u2264j\u2264B ||\u03b3j||0 be the upper bound on the number of features in segments and let L be\nan upper bound on elements of X. Let \u03c1 = min1\u2264j\u2264B |Bj| denote the number of samples in the\nsmallest segment. Then for a sequence \u03b4 = \u03b4n \u2192 0,\n\n\u03bb1 \u2265 4L\u03c3s ln 2Kp\n\n\u03c1\n\n\u03b4\n\nln 4Kp\n\u03b4\n\n\u03c1\n\n\u2228 8L\n\nand\n\nmin\n1\u2264j\u2264B\n\nmin\nk\u2208SBj |\u03b3j,k| \u2265 2\u03bb1,\n\nwe have\n\nP( \u02c6B = B) = 1,\n\nlim\nn\u2192\u221e\n\nlim\nn\u2192\u221e\n\nmax\n1\u2264j\u2264B\n\nP(||\u02c6\u03b3j \u2212 \u03b3j||1 = 0) = 1,\n\nlim\nn\u2192\u221e\n\nmin\n1\u2264j\u2264B\n\nP( \u02c6SBj = SBj ) = 1.\n\n(16)\n\n(17)\n\n(18)\n\nTheorem 2 states that asymptotically, the two stage procedure estimates the correct model, i.e., it\nselects the correct jump points and for each segment between two jump points it is able to select the\ncorrect covariates. Furthermore, we can conclude that the procedure is consistent.\n\n5 Practical considerations\nAs in standard Lasso, the regularization parameters in TDB-Lasso need to be tuned appropriately\nto attain correct structural recovery. The TD regression procedure requires three parameters: the\npenalty parameter \u03bb2, cut-off parameter \u03c4 , and weakness parameter \u03b1. From our empirical experi-\nence, the recovered set of jump points \u02c6T vary very little with respect to these parameters in a wide\nrange. The result of Theorem 1 is valid as long as \u03bb2 is larger than \u03bbmin given in the statement\nof the theorem. Theorem 1 in [22] gives a way to select the cutoff \u03c4 while controlling the number\nof falsely included jump points. Note that this relieves users from carefully choosing the range of\nparameter \u03bb2, which is challenging. The weakness parameter can be chosen in quite a large interval\n(see Appendix on the randomized Lasso) and we report our results using the values \u03b1 = 0.6.\nIn the second step of the algorithm, the ordinary Lasso minimizes Eq. (8) on each estimated segment\nto select relevant variables, which requires a choice of the penalty parameter \u03bb1. We do so by\nminimizing the BIC criterion [25].\n\nIn practice, one cannot verify assumptions A1 and A2 on real datasets. In cases where the assump-\ntions are violated, the resulting set of estimated jump points is larger than the true set T , e.g.\nthe\npoints close to the true jump points get included into the resulting estimate \u02c6T . We propose to use\nan ad hoc heuristic to re\ufb01ne the initially selected set of jump points. A commonly used procedure\nfor estimation of linear regression models with structural changes [3] is a dynamic programming\nmethod that considers a possible structural change at every location ti, i = 1, . . . , n, with a compu-\ntational complexity of O(n2) (see also [15]). We modify this method to consider jump points only\nin the estimated set \u02c6T and thus considerably reducing the computational complexity to O(| \u02c6T |2),\nsince | \u02c6T | \u226a n. The algorithm effectively chooses a subset \u02dcT \u2286 \u02c6T of size \u02c6B that minimizes the BIC\nobjective.\n\n6 Experiments on Synthetic Data\nWe compared the TDB-Lasso on synthetic data with commonly used methods for estimating VCVS\nmodels. The synthetic data was generated as follows. We varied the sample size from n = 100\n\n6\n\n\f1.5\n\n1\n\n0.5\n\n0\n\n \n\nKernel + l\n/l\n1\n\n2\n\nKernel + l\n1\n\nl\n + TV\n1\n\nTDB\u2212Lasso\n\nMREE\n\n \n\n1\n\n0.5\n\nCorr. zeros\n\nPrecision\n\nRecall\n\n1\n\n0.5\n\n1\n\n0.5\n\nF\n1\n\n1\n\n0.5\n\n0\n\n0\n\n0\n\n0\n\n200\n400\nSample size\nFigure 2: Comparison results of different estimation procedures on a synthetic dataset.\n\n200\n400\nSample size\n\n200\n400\nSample size\n\n200\n400\nSample size\n\n200\n400\nSample size\n\nto 500 time points, and \ufb01xed the number of covariates is \ufb01xed to p = 20. The block partition\nwas generated randomly and consists of ten blocks with minimum length set to 10 time points. In\neach of the block, only 5 covariates out of 20 affected the response. Their values were uniformly at\nrandom drawn from [\u22121,\u22120.1]\u222a[0.1, 1]. With this con\ufb01guration, a dataset was created by randomly\ndrawing Xi \u223c N (0, Ip), \u01ebi \u223c N (0, 1.52) and computing Yi = Xi\u03b2(ti) + \u01ebi for i = 1, . . . , n. For\neach sample size, we independently generated 100 datasets and report results averaged over them.\n\nA simple local regression method [13], which is commonly used for estimation in varying coef\ufb01cient\nmodels, was used as the simplest baseline for comparing the relative performance of estimation. Our\n\ufb01rst competitor is an extension of the baseline, which uses the following estimator [28]:\n\nn\n\nn\n\np\n\n(Yi \u2212 X\u2032\n\ni\u03b2i\u2032 )2Kh(ti\u2032 \u2212 ti) +\n\nmin\n\n\u03b2\u2208Rp\u00d7n\n\nXi\u2032=1\n\nXi=1\n\nXj=1\n\nn\n\n\u03bbjvuut\nXi\u2032=1\n\n\u03b22\ni\u2032,j,\n\n(19)\n\nwhere Kh(\u00b7) = 1\nh K(\u00b7/h) is the kernel function. We will call this method \u201cKernel \u21131/\u21132\u201d. Another\ncompetitor uses the \u21131 penalized local regression independently at each time point, which leads to\nthe following estimator of \u03b2(t),\n\nmin\n\u03b2\u2208Rp\n\nXi=1\n\nn\n\np\n\n(Yi \u2212 X\u2032\n\ni\u03b2)2Kh(ti \u2212 t) +\n\n\u03bbj|\u03b2j|.\n\n(20)\n\nXj=1\n\nWe call this method \u201cKernel \u21131\u201d. The difference between the two methods is that \u201cKernel \u21131/\u21132\u201d\nbiases certain covariates toward zero at every time point, based on global information; whereas\n\u201cKernel \u21131\u201d biases covariates toward zero only based on local information. The \ufb01nal competitor is\nchosen to be the minimizer of Eq. (2) [1], which we call \u201c\u21131 + TV\u201d. The bandwidth parameter for\n\u201cKernel \u21131\u201d and \u201cKernel \u21131/\u21132\u201d is chosen using a generalized cross validation of a non-penalized\nestimator. The penalty parameters \u03bbj are chosen according to the BIC criterion [28]. For the \u201c\u21131 +\nTV\u201d method, we optimize the BIC criterion over a two-dimensional grid of values for \u03bb1 and \u03bb2.\n\n, where \u02dc\u03b2 is the baseline\nWe report the relative estimation error, REE = 100 \u00d7\nlocal linear estimator, as a measure of estimation accuracy. To asses the performance of the model\nselection, we report precision, recall and their harmonic mean F1 measure when estimating the\nrelevant covariates at each time point and the percentage of correctly identi\ufb01ed irrelevant covariates.\n\ni=1 Pp\ni=1 Pp\n\nj=1 | \u02c6\u03b2i,j \u2212\u03b2\u2217\nj=1 | \u02dc\u03b2i,j \u2212\u03b2\u2217\n\nPn\nPn\n\ni,j |\ni,j |\n\nFrom the experimental results, summarized in Fig. 2, we can see that the TDB-Lasso succeeds in\nrecovering the true model as the sample size increases. It also estimates the coef\ufb01cient values with\nbetter accuracy than the other methods. It worth noting that the \u201cKernel + \u21131\u201d performs better than\nthe \u201cKernel + \u21131/\u21132\u201d approach, which is due to the violation of the assumptions made in [28]. The\n\u201c\u21131 + TV\u201d performs better than the local linear regression approaches, however, the method gets\nvery slow for the larger values of the sample size and it requires selecting two tuning parameters,\nwhich makes it quite dif\ufb01cult to use. We conjecture that the \u201c\u21131 + TV\u201d and TDB-Lasso have similar\nasymptotic properties with respect to model selection, however, from our numerical experiments we\ncan see that for \ufb01nite sample data, the TDB-Lasso performs better.\n\n7 Application to Time-varying Graph Structure Estimation\nAn interesting application of the TDB-Lasso is in structural estimation of time-varying undirected\ngraphical models [1, 17]. A graph structure estimation can be posed as a neighborhood selection\n\n7\n\n\fproblem, in which neighbors of each node are estimated independently. Neighborhood selection\nin the time-varying Gaussian graphical models (GGM) is equivalent to model selection in VCVS,\nwhere value of one node is regressed to the rest of nodes. The regression problem for each node\ncan be solved using the TDB-Lasso. Graphs estimated in this way will have neighborhoods of each\nnode that are constant on a partition, but the graph as a whole changes more \ufb02exibly (Fig. 1b-d).\n\nt=3.00s\n\na\na\n\nj\n\n \nt\nc\ne\nb\nu\nS\n\nt=2.00s\n\nt=1.00s\n\nThe graph structure estimation using the TDB-\nLasso is demonstrated on a real dataset of elec-\ntroencephalogram (EEG) measurements. We use\nthe brain computer interface (BCI) dataset IVa\nfrom [11] in which the EEG data is collected from\n5 subjects, who were given visual cues based on\nwhich they were required to imagine right hand\nor right foot for 3.5s. The measurement was per-\nformed when the visual cues were presented on the\nscreen (280 times), intermitted by periods of random length in which the subject could relax. We\nuse the down-sampled data at 100Hz. Fig. 3 gives a visualization of the brain interactions over the\ntime of the experiment for the subject \u2019aa\u2019 while presented with visual cues for the class 1 (right\nhand). Estimated graphs of interactions between different parts of the brain for other subjects and\nclasses are given in Appendix due to the space limit.\n\nFigure 3: Brain interactions for the subject \u2019aa\u2019\nwhen presented with visual cues of the class 1\n\nWe also want to study whether the estimated time-varying network are discriminative features for\nclassifying the type of imaginations in the EEG signal. For this purpose, we perform unsupervised\nclustering of EEG signals using the time-varying networks and study whether the grouping corre-\nspond to the true grouping according to imagination label. We estimate a time-varying GGM using\nthe TDB-Lasso for each visual cue and cluster the graphs using the spectral K-means clustering [29]\n(using a linear kernel on the coef\ufb01cients to measure similarity). Each cluster is labeled according to\nthe majority of points it contains. Finally, each cue if classi\ufb01ed based on labels of the time-points\nthat it contains. Table 1 summarizes the classi\ufb01cation accuracy for each subject based on K = 4\nclusters (K was chosen as a cutoff point, when there was little decrease in K-means objective). We\ncompare this approach to a case when GGMs with a static structure are estimated [5]. Note that\nthe supervised classi\ufb01ers with special EEG features are able to achieve much higher classi\ufb01cation\naccuracy, however, our approach does not use any labeled data and can be seen as an exploratory\nstep. We also used TDB-Lasso for estimating the time-varying gene networks from microarray data\ntime series data, but due to space limit, results will be reported later in a biological paper.\n\nTable 1: Classi\ufb01cation accuracies based on learned brain interactions.\n\nSubject\n\nTDB-Lasso\n\nStatic\n\naa\n0.69\n0.58\n\nal\n0.80\n0.63\n\nav\n0.59\n0.54\n\naw\n0.67\n0.57\n\nay\n0.83\n0.61\n\n8 Discussion\n\nWe have developed the TDB-Lasso procedure, a novel approach for model selection and variable es-\ntimation in the varying-coef\ufb01cient varying-structure models with piecewise constant functions. The\nVCVS models form a \ufb02exible nonparametric class of models that retain interpretability of parametric\nmodels. Due to their \ufb02exibility, important classical problems, such as linear regression with struc-\ntural changes and change point detection, and some more recent problems, like structure estimation\nof varying graphical models, can be modeled within this class of models. The TDB-Lasso compares\nfavorably to other commonly used [28] or latest [1] techniques for estimation in this class of models,\nwhich was demonstrated on the synthetic data. The model selection properties of the TDB-Lasso,\ndemonstrated on the synthetic data, are also supported by the theoretical analysis. Furthermore, we\ndemonstrate a way of applying the TDB-Lasso for graph estimation on a real dataset.\n\nApplication of the TDB-Lasso procedure goes beyond the linear varying coef\ufb01cient regression mod-\nels. A direct extension is to generalized varying-coef\ufb01cient models g(m(Xi, ti)) = X\u2032\ni\u03b2(ti), i =\n1, . . . , n, where g(\u00b7) is a given link function and m(Xi, ti) = E[Y |X = Xi, t = ti] is the con-\nditional mean. Estimation in generalized varying-coef\ufb01cient models proceeds by changing the\nsquared loss in Eq. (7) and Eq. (8) to a different appropriate loss function. The generalized varying-\ncoef\ufb01cient models can be used to estimate the time-varying structure of discrete Markov Random\nFields, again by performing the neighborhood selection.\n\n8\n\n\fReferences\n[1] Amr Ahmed and Eric P. Xing. Tesla: Recovering time-varying networks of dependencies in social and\n\nbiological studies. Proceeding of the National Academy of Science, 2009.\n\n[2] Francis R. Bach. Bolasso: model consistent lasso estimation through the bootstrap. In William W. Cohen,\nAndrew McCallum, and Sam T. Roweis, editors, ICML, volume 307 of ACM International Conference\nProceeding Series, pages 33\u201340. ACM, 2008.\n\n[3] J Bai and P Perron. Computation and analysis of multiple structural change models. Journal of Applied\n\n[4] Jushan Bai and Pierre Perron. Estimating and testing linear models with multiple structural changes.\n\nEconometrics, (18):1\u201322, 2003.\n\nEconometrica, 66(1):47\u201378, January 1998.\n\n[5] O. Banerjee, L. El Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum likelihood\n\nestimation. J. Mach. Learn. Res., 9:485\u2013516, 2008.\n\n[6] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of lasso and dantzig selector. Ann. of Stat.\n[7] Florentina Bunea. Honest variable selection in linear and logistic regression models via \u21131 and \u21131 + \u21132\n\npenalization. Electronic Journal of Statistics, 2:1153, 2008.\n\n[8] Scott S. Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit.\n\nSIAM Journal on Scienti\ufb01c Computing, 20(1):33\u201361, 1999.\n\n[9] William S. Cleveland, Eric Grosse, and William M. Shyu. Local regression models. In John M. Chambers\n\nand Trevor J. Hastie, editors, Statistical Models in S, pages 309\u2013376, 1991.\n\n[10] David L. Donoho, Michael Elad, and Vladimir N. Temlyakov. Stable recovery of sparse overcomplete\n\nrepresentations in the presence of noise. IEEE Trans. Inform. Theory, 52:6\u201318, 2006.\n\n[11] G. Dornhege, B. Blankertz, G. Curio, and K. M\u00a8uller. Boosting bit rates in non-invasive EEG single-trial\nclassi\ufb01cations by feature combination and multi-class paradigms. IEEE Trans. Biomed. Eng., 51:993\u2013\n1002, 2004.\n\n[12] Jianqing Fan and Qiwei Yao. Nonlinear Time Series: Nonparametric and Parametric Methods. (Springer\n\nSeries in Statistics). Springer, August 2005.\n\nStatistics, 27:1491\u20131518, 2000.\n\n[13] Jianqing Fan and Wenyang Zhang. Statistical estimation in varying-coef\ufb01cient models. The Annals of\n\n[14] Za\u00a8\u0131d Harchaoui, Francis Bach, and \u00b4Eric Moulines. Kernel change-point analysis. In D. Koller, D. Schu-\nurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21. 2009.\nIn J.C. Platt, D. Koller,\nY. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 617\u2013\n624. MIT Press, Cambridge, MA, 2008.\n\n[15] Za\u00a8\u0131d Harchaoui and C\u00b4eline Levy-Leduc. Catching change-points with lasso.\n\n[16] Trevor Hastie and Robert Tibshirani. Varying-coef\ufb01cient models. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), 55(4):757\u2013796, 1993.\n\n[17] Mladen Kolar, Le Song, and Eric Xing. Estimating time-varying networks. In arXiv:0812.5087, 2008.\n[18] Marc Lavielle and Eric Moulines. Least-squares estimation of an unknown number of shifts in a time\n\nseries. Journal of Time Series Analysis, 21(1):33\u201359, 2000.\n\n[19] E. Lebarbier. Detecting multiple change-points in the mean of gaussian process by model selection. Signal\n\n[22] Nicolai Meinshausen and Peter B\u00a8uhlmann. Stability selection. Preprint, 2008.\n[23] Alessandro Rinaldo. Properties and re\ufb01nements of the fused lasso. Preprint, 2008.\n[24] Le Song, Mladen Kolar, and Eric P. Xing. Keller: Estimating time-evolving interactions between genes.\nIn Proceedings of the 16th International Conference on Intelligent Systems for Molecular Biology, 2009.\n[25] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness\n\nvia the fused lasso. Journal Of The Royal Statistical Society Series B, 67(1):91\u2013108, 2005.\n\n[26] S. A. van de Geer and P. Buhlmann. On the conditions used to prove oracle results for the lasso, 2009.\n[27] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy recovery of sparsity. Preprint, 2006.\n[28] H. Wang and Y. Xia. Shrinkage estimation of the varying coef\ufb01cient model. Manuscript, 2008.\n[29] H Zha, C Ding, M Gu, X He, and H Simon. Spectral relaxation for k-means clustering. pages 1057\u20131064.\n\nMIT Press, 2001.\n\n[30] P. Zhao and B. Yu. On model selection consistency of lasso. J. Mach. Learn. Res., 7:2541\u20132563, 2006.\n\n[20] E. Mammen and S. van de Geer. Locally adaptive regression splines. Ann. of Stat., 25(1):387\u2013413, 1997.\n[21] N. Meinshausen and P. B\u00a8uhlmann. High-dimensional graphs and variable selection with the lasso. Annals\n\nProcess., 85(4):717\u2013736, 2005.\n\nof Statistics, 34:1436, 2006.\n\n9\n\n\f", "award": [], "sourceid": 538, "authors": [{"given_name": "Mladen", "family_name": "Kolar", "institution": null}, {"given_name": "Le", "family_name": "Song", "institution": null}, {"given_name": "Eric", "family_name": "Xing", "institution": null}]}