{"title": "Tracking Time-varying Graphical Structure", "book": "Advances in Neural Information Processing Systems", "page_first": 1205, "page_last": 1213, "abstract": "Structure learning algorithms for graphical models have focused almost exclusively on stable environments in which the underlying generative process does not change; that is, they assume that the generating model is globally stationary. In real-world environments, however, such changes often occur without warning or signal. Real-world data often come from generating models that are only locally stationary. In this paper, we present LoSST, a novel, heuristic structure learning algorithm that tracks changes in graphical model structure or parameters in a dynamic, real-time manner. We show by simulation that the algorithm performs comparably to batch-mode learning when the generating graphical structure is globally stationary, and significantly better when it is only locally stationary.", "full_text": "Tracking Time-varying Graphical Structure\n\nErich Kummerfeld\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nekummerf@andrew.cmu.edu\n\nDavid Danks\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nddanks@andrew.cmu.edu\n\nAbstract\n\nStructure learning algorithms for graphical models have focused almost exclu-\nsively on stable environments in which the underlying generative process does not\nchange; that is, they assume that the generating model is globally stationary. In\nreal-world environments, however, such changes often occur without warning or\nsignal. Real-world data often come from generating models that are only locally\nIn this paper, we present LoSST, a novel, heuristic structure learn-\nstationary.\ning algorithm that tracks changes in graphical model structure or parameters in a\ndynamic, real-time manner. We show by simulation that the algorithm performs\ncomparably to batch-mode learning when the generating graphical structure is\nglobally stationary, and signi\ufb01cantly better when it is only locally stationary.\n\n1\n\nIntroduction\n\nGraphical models are used in a wide variety of domains, both to provide compact representations\nof probability distributions for rapid, ef\ufb01cient inference, and also to represent complex causal struc-\ntures. Almost all standard algorithms for learning graphical model structure [9, 10, 12, 3] assume\nthat the underlying generating structure does not change over the course of data collection, and so\nthe data are i.i.d. (or can be transformed into i.i.d. data). In the real world, however, generating\nstructures often change and it can be critical to quickly detect the structure change and then learn\nthe new one.\nIn many of these real-world contexts, we also do not have the luxury of collecting large amounts of\ndata and then retrospectively determining when (if ever) the structure changed. That is, we cannot\nlearn in \u201cbatch mode,\u201d but must instead learn the novel structure in an online manner, processing the\ndata as it arrives. Current online learning algorithms can detect and handle changes in the learning\nenvironment, but none are capable of general, graphical model structure learning.\nIn this paper, we develop a heuristic algorithm that \ufb01lls this gap: it assumes only that our data are\nlocally i.i.d., and learns graphical model structure in an online fashion.\nIn the next section, we\nquickly survey related methods and show that they are individually insuf\ufb01cient for this task. We\nthen present the details of our algorithm and provide simulation evidence that it can successfully\nlearn graphical model structure in an online manner. Importantly, when there is a stable generating\nstructure, the algorithm\u2019s performance is indistinguishable from a standard batch-mode structure\nlearning algorithm. Thus, using this algorithm incurs no additional costs in \u201cnormal\u201d structure\nlearning situations.\n\n2 Related work\n\nWe focus here on graphical models based on directed acyclic graphs (DAGs) over random variables\nwith corresponding quantitative components, whether Bayesian networks or recursive Structural\nEquation Models (SEMs) [3, 12, 10]. All of our observations in this paper, as well as the core\n\n1\n\n\falgorithm, are readily adaptable to learn structure for models based on undirected graphs, such as\nMarkov random \ufb01elds or Gaussian graphical models [6, 9].\nStandard graphical model structure learning algorithms divide into two rough types. Bayesian/score-\nbased methods aim to \ufb01nd the model M that maximizes P (M|Data), but in practice, score the\nmodels using a decomposable measure based on P (Data|M ) and the number of parameters in M\n[3]. Constraint-based structure learning algorithms leverage the fact that every graphical model\npredicts a pattern of (conditional) independencies over the variables, though multiple models can\npredict the same pattern. Those algorithms (e.g., [10, 12]) \ufb01nd the set of graphical models that best\npredict the (conditional) independencies in the data.\nBoth types of structure learning algorithms assume that the data come from a single generating\nstructure, and so neither is directly usable for learning when structure change is possible. They learn\nfrom the suf\ufb01cient statistics, but neither has any mechanism for detecting change, responding to it, or\nlearning the new structure. Bayesian learning algorithms\u2014or various approximations to them\u2014are\noften used for online learning, but precisely because case-by-case Bayesian updating yields the same\noutput as batch-mode processing (assuming the data are i.i.d.). Since we are focused on situations\nin which the underlying structure can change, we do not want the same output.\nOne could instead look to online learning methods that track some environmental feature. The\nclassic TDL algorithm, TD(0) [13], provides a dynamic estimate Et(X) of a univariate random\nvariable X using a simple update rule: Et+1(X) \u2190 Et(X) + \u03b1(Xt \u2212 Et(X)), where Xt is the\nvalue of X at time t. The static \u03b1 parameter encodes the learning rate, and trades off convergence\nrate and robustness to noise (in stable environments). In general, TDL methods are good at tracking\nslow-moving environmental changes, but perform suboptimally during times of either high stability\nor dramatic change, such as when the generating model structure abruptly changes.\nBoth Bayesian [1] and frequentist [4] online changepoint detection (CPD) algorithms are effective\nat detecting abrupt changes, but do so by storing substantial portions of the input data. For example,\na Bayesian CPD [1] outputs the probability of a changepoint having occurred r timesteps ago, and\nso the algorithm must store more than r datapoints. Furthermore, CPD algorithms assume a model\nof the environment that has only abrupt changes separated by periods of stability. Environments that\nevolve slowly but continuously will have their time-series discretized in seemingly arbitrary fashion,\nor not at all.\nTwo previous papers have aimed to learn time-indexed graph structures from time-series data,\nthough both require full datasets as input, so cannot function in real-time [14, 11]. Talih and Hen-\ngartner (2005) take an ordered data set and divide it into a \ufb01xed number of (possibly empty) data\nintervals, each with an associated undirected graph that differs by one edge from its neighbors. In\ncontrast with our work, they focus on a particular type of graph structure change (single edge ad-\ndition or deletion), operate solely in \u201cbatch mode,\u201d and use undirected graphs instead of directed\nacyclic graph models. Siracusa and Fisher III (2009) uses a Bayesian approach to \ufb01nd the posterior\nuncertainty over the possible directed edges at different points in a time-series. Our approach differs\nby using frequentist methods instead of Bayesian ones (since we would otherwise need to maintain\na probability distribution over the superexponential number of graphical models), and by being able\nto operate in real-time on an incoming data stream.\n\n3 Locally Stationary Structure Tracker (LoSST) Algorithm\n\nGiven a set of continuous variables V, we assume that there is, at each time r, a true underlying\n(cid:80)\ngenerative model Gr over V. Gr is assumed to be a recursive Structural Equation Model (SEM):\na pair (cid:104)G, F(cid:105), where G denotes a DAG over V, and F is a set of linear equations of the form Vi =\nVj\u2208pa(Vi) aji \u00b7 Vj + \u0001i, where pa(Vi) denotes the variables Vj \u2208 G such that Vj \u2192 Vi, and the\n\u0001i are normally distributed noise/error terms. In contrast to previous work on structure learning, we\nassume only that the generating process is locally stationary: for each time r, data are generated\ni.i.d. from Gr, but it is not necessarily the case that Gr = Gs for r (cid:54)= s. Notice that Gr can change\nin both structure (i.e., adding, removing, or reorienting edges) and parameters (i.e., changes in aji\u2019s\nor the \u0001i distributions).\nAt a high level, the Locally Stationary Structure Tracker (LoSST) algorithm takes, at each timestep\nr, a new datapoint as input and outputs a graphical model Mr. Obviously, a single datapoint is\n\n2\n\n\finsuf\ufb01cient to learn graphical model structure. The LoSST algorithm instead tracks the locally\nstationary suf\ufb01cient statistics\u2014for recursive SEMs, the means, covariances, and sample size\u2014in an\nonline fashion, and then dynamically (re)learns the graphical model structure as appropriate. The\nLoSST algorithm processes each datapoint only once, and so LoSST can also function as a single-\npass, graphical model structure learner for very large datasets.\nLet Xr be the r-th multivariate datapoint and let X r\ni be the value of Vi for that datapoint. To track the\npotentially changing generating structure, the datapoints must potentially be differentially weighted.\nIn particular, datapoints should be weighted more heavily after a change occurs. Let ar \u2208 (0,\u221e) be\n\nthe weight on Xr, and let br =(cid:80)r\n\nk=1 ak be the sum of those weights over time.\n\nThe weighted mean of Vi after datapoint r is \u00b5r\nonline fashion using the update equation:\n\nak\nbr\n\nX k\n\ni , which can be computed in an\n\nk=1\n\ni = (cid:80)r\n\n\u00b5r+1\ni =\n\nbr\nbr+1\n\n\u00b5r\n\ni +\n\nar+1\nbr+1\n\nX r+1\n\ni\n\n(1)\n\n(cid:80)r\n\nak\nbr\n\nk=1\n\nThe (weighted) covariance between Vi and Vj after datapoint r is provably equal to Cr\n\n(X r+1\n\ni \u2212 \u00b5r\n\n=\ni ). The update equation for\n\nVi,Vj\n\ni \u2212 \u00b5r\n\n(X r\n\ni )(X r\n\nj \u2212 \u00b5r\n\nj ). Let \u03b4i = \u00b5r+1\n\ni \u2212 \u00b5r\n\ni = ar+1\nbr+1\n\nCr+1 can be written (after some algebra) as:\n\nCr+1\n\nXi,Xj\n\n=\n\n1\n\nbr+1\n\n[brCr\n\nXi,Xj\n\n+ br\u03b4i\u03b4j + ar+1(X r+1\n\ni \u2212 \u00b5r+1\n\ni\n\n)(X r+1\n\nj \u2212 \u00b5r+1\n\nj\n\n)]\n\n(2)\n\nIf ak = c for all k and some constant c > 0, then the estimated covariance matrix is identical to the\nbatch-mode estimated covariance matrix. If ar = \u03b1br, then the learning is the same as if one uses\nTD(0) learning for each covariance with a learning rate of \u03b1.\nThe sample size Sr is more complicated, since datapoints are weighted differently and so the \u201cef-\nfective\u201d sample size can differ from the actual sample size (though it should always be less-than-or-\nequal). Because Xr+1 comes from the current generating structure, it should always contribute 1 to\nmore than Xr. If we adjust the natural\nthe effective sample size. In addition, Xr+1 is weighted ar+1\nar\nsample size update equation to satisfy these two constraints, then the update equation becomes:\n\nar\nar+1\n\nSr + 1\n\nSr+1 =\n\n(3)\nIf ar+1 \u2265 ar for all r (as in the method we use below), then Sr+1 \u2264 Sr + 1. If ar+1 = ar for all r,\nthen Sr = r; that is, if the datapoint weights are constant, then Sr is the true sample size.\nSuf\ufb01cient statistics tracking\u2014\u00b5r+1, Cr+1, and Sr+1\u2014thus requires remembering only their previ-\nous values and br, assuming that ar+1 can be ef\ufb01ciently computed. The ar+1 weights are based on\nthe \u201c\ufb01t\u201d between the current estimated covariance matrix and the input data: poor \ufb01t implies that\na change in the underlying generating structure is more likely. For multivariate Gaussian data, the\n\u201c\ufb01t\u201d between Xr+1 and the current estimated covariance matrix Cr is given by the Mahalanobis\ndistance Dr+1 [8]: Dr+1 = (Xr+1 \u2212 \u00b5r)(Cr)\u22121(Xr+1 \u2212 \u00b5r)T .\nA large Mahalanobis distance (i.e., poor \ufb01t) for some datapoint could indicate simply an outlier;\ninferring that the underlying generating structure has changed requires large Mahalanobis distances\nover multiple datapoints. The likelihood of the (weighted) sequence of Dr\u2019s is analytically in-\ntractable, and so we cannot use the Dr values directly. We instead base the ar+1 weights on the\n(weighted) pooled p-value of the individual p-values for the Mahalanobis distance of each datapoint.\nThe Mahalanobis distance of a V -dimensional datapoint from a covariance matrix estimated from\na sample of size N is distributed as Hotelling\u2019s T 2 with parameters p = V and m = N \u2212 1. The\np-value for the Mahalanobis distance Dr+1 is thus: pr+1 = T 2(x > Dr+1|p = N, m = Sr \u2212 1)\nwhere Sr is the effective sample size. Let \u03a6(x, y) be the cdf of a Gaussian with mean 0 and variance\ny evaluated at x. Then Liptak\u2019s method for weighted pooling of the individual p-values [7] gives\n\u03c4r+1), where the\n\n\u221a\ni ) = \u03a6(\u03b7r+1,\nupdate equations for \u03b7 and \u03c4 are \u03b7r+1 = \u03b7r + ar\u03a6\u22121(pr, 1) and \u03c4r+1 = \u03c4r + a2\nr.\n\nthe following de\ufb01nition:1 \u03c1r+1 = \u03a6((cid:80)r\n\ni=1 ai\u03a6\u22121(pi, 1),(cid:112)(cid:80) a2\n\n1\u03c1r+1 cannot include pr+1 without being circular: pr+1 would have to be appropriately weighted by ar+1,\n\nbut that weight depends on \u03c1r+1.\n\n3\n\n\fThere are many ways to convert the pooled p-value \u03c1r+1 into a weight ar+1. We use the strategy:\nif \u03c1r+1 is greater than some threshold T (i.e., the data sequence is suf\ufb01ciently likely given the\ncurrent model), then keep the weight constant; if \u03c1r+1 is less that T , then increase ar+1 linearly and\ninversely to \u03c1r+1 up to a maximum of \u03b3ar at \u03c1r+1 = 0. Mathematically, this transformation is:\n\n(cid:26)\n\n(cid:27)\n\nar+1 = ar \u00b7 max\n\n\u03b3T \u2212 \u03b3\u03c1r+1 + \u03c1r+1\n\nT\n\n1,\n\n(4)\n\nEf\ufb01cient computation of ar+1 thus only requires additionally tracking \u03c1r, \u03b7r, and \u03c4r.\nWe can ef\ufb01ciently track the relevant suf\ufb01cient statistics in an online fashion, and so the only remain-\ning step is to learn the corresponding graphical model. The implementation in this paper uses the\nPC algorithm [12], a standard constraint-based structure learning algorithm. A range of alternative\nstructure learning algorithms could be used instead, depending on the assumptions one is able to\nmake.\nLearning graphical model structure is computationally expensive [2] and so one should balance the\naccuracy of the current model against the computational cost of relearning. More precisely, graph2\nrelearning should be most frequent after an inferred underlying change, though there should be a\nnon-zero chance of relearning even when the structure appears to be relatively stable (since the\nstructure could be slowly drifting).\nIn practice, the LoSST algorithm probabilistically relearns based on the inverse3 of \u03c1r: the prob-\nability of relearning at time r + 1 is a noisy-OR gate with the probability of relearning at time r,\nand a weighted (1 \u2212 \u03c1r+1). Mathematically, Pr+1(relearn) = Pr(relearn) + \u03bd(1 \u2212 \u03c1r+1) \u2212\nPr(relearn)\u03bd(1 \u2212 \u03c1r+1), where \u03bd \u2208 [0, 1] modi\ufb01es the frequency of graph relearning: large values\nresult in more frequent relearning and small values result in fewer. If a relearning event is triggered\nat datapoint r, then a new graphical model structure and parameters are learned, and Pr(relearn)\nis set to 0. In general, \u03c1r is lower when changepoints are detected, so Pr(relearn) will increase\nmore quickly around changepoints, and graph relearning will become more frequent. During times\nof stability, \u03c1r will be comparatively large, resulting in a slower increase of Pr(relearn) and thus\nless frequent graph relearning.\n\n3.1 Convergence vs. diligence in LoSST\n\nLoSST is capable of exhibiting different long-run properties, depending on its parameters. Con-\nvergence is a standard desideratum: if there is a stable structure in the limit, then the algorithm\u2019s\noutput should stabilize on that structure. In contexts in which the true structure can change, another\ndesirable property for learning algorithms is diligence: if the generating structure has a change of\ngiven size (that manifests in the data), then the algorithm should detect and respond to that change\nwithin a \ufb01xed number of datapoints (regardless of the amount of previous data). Both diligence and\nconvergence are desirable methodological virtues, but they are provably incompatible: no learning\nalgorithm can be both diligent and convergent [5]. Intuitively, they are incompatible because they\nmust respond differently to improbable datapoints: convergent algorithms must tolerate them (since\nsuch data occur with probability 1 in the in\ufb01nite limit), while diligent algorithms must regard them\nas signals that the structure has changed.\nIf \u03b3 = 1, then LoSST is a convergent algorithm, since it follows that ar+1 = ar for all r (which\nis a suf\ufb01cient condition for convergence). For \u03b3 > 1, the behavior of LoSST depends on T . If\nT < 0, then we again have ar+1 = ar for all r, and so LoSST is convergent. LoSST is also\nprovably convergent if T is time-indexed such that Tr = f (Sr) for some f with (0, 1] range, where\n\n(cid:80)\u221e\ni=1 (1 \u2212 f (i)) converges.4\n4Proof sketch: (cid:80)\u221e\n2Recall that the suf\ufb01cient statistics are updated after every datapoint.\n3Recall that \u03c1r is a pooled p-value, so low values indicate unlikely data.\nAny sequence Q s.t. (cid:80)\u221e\n\ni=r (1 \u2212 qi) can be shown to be an upper bound on the probability that (1 \u2212 \u03c1i) > qi\nwill occur for some i in [r,\u221e), where qi is the i-th element of the sequence Q of lower threshold values.\ni=1 (1 \u2212 qi) < 1 will then guarantee that an in\ufb01nite amount of unbiased data will\nbe accumulated in the in\ufb01nite limit. This provides probability 1 convergence for LoSST, since the structure\nlearning method has probability 1 convergence in the limit. If Q is prepended with arbitrary strictly positive\nthreshold values, the \ufb01rst element of Q will still be reached in\ufb01nitely many times with probability 1 in the\nin\ufb01nite limit, and so LoSST will still converge with probability 1, even using these expanded sequences.\n\n4\n\n\fIn contrast, if T > 1 and \u03b3 > 1, then LoSST is provably diligent.5 We conjecture that there are\nsequences of time-indexed Tr < 1 that will also yield diligent versions of LoSST, analogously to\nthe condition given above for convergence.\nInterestingly, if \u03b3 > 1 and 0 < T < 1, then LoSST is neither convergent nor diligent, but rather\nstrikes a balance between the desiderata. In particular, these versions (a) tend to converge towards\nstable structures, but provably do not actually converge since they remain sensitive to outliers; and\n(b) respond quickly to change in generating structure, but only exponentially fast in the number of\nprevious datapoints, rather than within a \ufb01xed interval. The full behavior of LoSST in this parameter\nregime, including the extent and sensitivity of trade-offs, is an open question for future research.\nFor the simulations below, unsystematic investigation led to T = 0.05 and \u03b3 = 3, which seemed to\nappropriately trade off convergence vs. diligence in that context.\n\n4 Simulation results\n\nWe used synthetic data to evaluate the performance of LoSST given known ground truth. All sim-\nulations used scenarios in which either the ground truth parameters or ground truth graph (and pa-\nrameters) changed during the course of data collection. Before the \ufb01rst changepoint, there should be\nno signi\ufb01cant difference between LoSST and a standard batch-mode learner, since those datapoints\nare globally i.i.d. Performance on these datapoints thus provides information about the performance\ncost (if any) of online learning using LoSST, relative to traditional algorithms. After a changepoint,\none is interested both in the absolute performance of LoSST (i.e., can it track the changes?) and in\nits performance relative to a standard batch-mode algorithm (i.e., what performance gain does it pro-\nvide?). We used the PC algorithm [12] as our baseline batch-mode learning algorithm; we conjecture\nthat any other standard graphical model structure learning algorithm would perform similarly, given\nthe graphs and sample sizes in our simulations.\nIn order to directly compare the performance of LoSST and PC, we imposed a \ufb01xed \u201cgraph relearn-\ning\u201d schedule6 on LoSST. The \ufb01rst set of simulations used datasets with 2000 datapoints, where the\nSEM graph and parameters both changed after the \ufb01rst 1000 datapoints. 500 datasets were generated\nfor each of a range of (cid:104)#variables, M axDegree(cid:105) pairs,7 where each dataset used two different,\nrandomly generated SEMs of the speci\ufb01ed size and degree.\nFigures 1(a-c) show the mean edge addition, removal, and orientation errors (respectively) by\nLoSST as a function of time, and Figures 1(d-f) show the means of #errorsP C \u2212 #errorsLoSST\nfor each error type (i.e., higher numbers imply LoSST outperforms PC). In all Figures, each\n(cid:104)variable, degree(cid:105) pair is a distinct line. As expected, LoSST was basically indistinguishable from\nPC for the \ufb01rst 1000 datapoints; the lines in Figures 1(d-f) for that interval are all essentially zero.\nAfter the underlying generating model changes, however, there are signi\ufb01cant differences. The PC\nalgorithm performs quite poorly because the full dataset is essentially a mixture from two different\ndistributions which induces a large number of spurious associations. In contrast, the LoSST algo-\nrithm \ufb01nds large Mahalanobis distances for those datapoints, which lead to higher weights, which\nlead it to learn (approximately) the new underlying graphical model. In practice, LoSST typically\nstabilized on a new model by roughly 250 datapoints after the changepoint.\nThe second set of simulations was identical to the \ufb01rst (500 runs each for various pairs of variable\nnumber and edge degree), except that the graph was held constant throughout and only the SEM\nparameters changed after 1000 datapoints. Figures 2(a-c) and 2(d-f) report, for these simulations, the\nsame measures as Figures 1(a-c) and 1(d-f). Again, LoSST and PC performed basically identically\nfor the \ufb01rst 1000 datapoints. Performance after the parameter change did not follow quite the same\npattern as before, however. LoSST again does much better on edge addition and orientation errors,\nbut performed signi\ufb01cantly worse on edge removal errors for the \ufb01rst 200 points following the\n\n5Proof sketch: By equation (4), T > 1 & \u03b3 > 1 \u21d2 \u03b3 \u2212 \u03b3\u22121\n\nT ) > ar for all\nr. This last strict inequality implies that the effective sample size has a \ufb01nite upper bound (= \u03b3T\u2212\u03b3+1\n(\u03b3\u22121)(T\u22121) if\n\u03c1r = 1 for all r), and the majority of the effective sample comes from recent data points. These two conditions\nare jointly suf\ufb01cient for diligence.\n6LoSST relearned graphs and PC was rerun after datapoints {25, 50, 100, 200, 300, 500, 750, 1000, 1025,\n1050, 1100, 1200, 1300, 1500, 1750, 2000}.\n7Speci\ufb01cally, (cid:104)4, 3(cid:105), (cid:104)8, 3(cid:105), (cid:104)10, 3(cid:105), (cid:104)10, 7(cid:105), (cid:104)15, 4(cid:105), (cid:104)15, 9(cid:105), (cid:104)20, 5(cid:105), and (cid:104)20, 12(cid:105)\n\nT > 1 \u21d2 ar+1 \u2265 ar(\u03b3 \u2212 \u03b3\u22121\n\n5\n\n\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 1: Structure & parameter changes: (a-c) LoSST errors; (d-f) LoSST improvement over PC\n\n(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 2: Parameter changes: (a-c) LoSST errors; (d-f) LoSST improvement over PC\n\nchange. When a change occcurs, PC intially responds by adding edges to the output, while LoSST\nresponds by being more cautious in its inferences (since the effective sample size shrinks after a\nchange). The short-term impact on each algorithm is thus: PC\u2019s output tends to be a superset of\nthe original edges, while LoSST outputs fewer edges. As a result, PC can outperform LoSST for\na brief time on the edge removal metric in these types of cases in which the change involves only\nparameters, not graph structure.\nThe third set of simulations was designed to explore in detail the performance with probabilistic\nrelearning. We randomly generated a single dataset with 10,000 datapoints, where the underlying\nSEM graph and parameters changed after every 1000 datapoints. Each SEM had 10 variables and\nmaximum degree of 7. We then ran LoSST with probabilistic relearning (\u03bd = .005) 500 times\non this dataset. Figure 3(a) shows the (observed) expected number of \u201crelearnings\u201d in each 25-\n\n6\n\n\f(a)\n\n(c)\n\n(d)\n\n(b)\n\nFigure 3: (a) LoSST expected relearnings; (b-d) Expected edge additions, removals, and \ufb02ips,\nagainst constant relearning\n\n(a)\n\n(b)\n\n(c)\n\nFigure 4: (a) Effective sample size during LoSST run on BLS data; (b) Pooled p-values; (c) Maha-\nlanobis distances\n\ndatapoint window. As expected, there are substantial relearning peaks after each structure shift, and\nthe expected number of relearnings persisted at roughly 0.1 per 25 datapoints throughout the stable\nperiods. Figures 3(b-d) provide error information: the smooth green lines indicate the mean edge\naddition, removal, and orientation errors (respectively) during learning, and the blocky blue lines\nindicate the LoSST errors if graph relearning occurred after every datapoint. Although there are\nmany fewer graph relearnings with the probabilistic schedule, overall errors did not signi\ufb01cantly\nincrease.\n\n5 Application to US price index volatility\n\nTo test the performance of the LoSST algorithm on real-world data, we applied it to seasonally\nadjusted price index data from the U.S. Bureau of Labor Statistics. We limited the data to commodi-\nties/services with data going back to at least 1967, resulting in a data set of 6 variables: Apparel,\nFood, Housing, Medical, Other, and Transportation. The data were collected monthly from 1967-\n2011, resulting in 529 data points. Because of signi\ufb01cant trends in the indices over time, we used\nmonth-to-month differences.\nFigure 4(a) shows the change in effective sample size, where the key observation is that change\ndetection prompts signi\ufb01cant drops in the effective sample size. Figures 4(b) and 4(c) show the\npooled p-value and Mahalanobis distance for each month, which are the drivers of sample size\n\n7\n\n\fchanges. The Great Moderation was a well-known macroeconomic phenomenon between 1980 and\n2007 in which the U.S. \ufb01nancial market underwent a slow but steady reduction in volatility. LoSST\nappears to detect exactly such a shift in the volatility of the relationships between these price indexes,\nthough it detects another shift shortly after 2000.8 This real-world case study also demonstrates the\nimportance of using pooled p-values, as that is why LoSST does not respond to the single-month\nspike in Mahalanobis distance in 1995, but does respond to the extended sequence of slightly above\naverage Mahalanobis distances around 1980.\n\n6 Discussion and future research\n\nThe LoSST algorithm is suitable for locally stationary structures, but there are obviously limits. In\nparticular, it will perform poorly if the generating structure changes very rapidly, or if the datapoints\nare a random-order mixture from multiple structures. An important future research direction is to\ncharacterize and then improve LoSST\u2019s performance on more rapidly varying structures. Various\nheuristic aspects of LoSST could also potentially be replaced by more normative procedures, though\nas noted earlier, many will not work without substantial revision (e.g., obvious Bayesian methods).\nThis algorithm can also be extended to have the current learned model in\ufb02uence the ar weights.\nSuppose particular graphical edges or adjacencies have not changed over a long period of time, or\nhave been stable over multiple relearnings. In that case, one might plausibly conclude that those\nconnections are less likely to change, and so much greater error should be required to relearn those\nconnections. In practice, this extension would require the ar weights to vary across (cid:104)Vi, Vj(cid:105) pairs,\nwhich signi\ufb01cantly complicates the mathematics and memory requirements of the suf\ufb01cient statistic\ntracking. It is an open question whether the (presumably) improved tracking would compensate for\nthe additional computational and memory cost in particular domains.\nWe have focused on SEMs, but there are many other types of graphical models; for example,\nBayesian networks have the same graph-type but are de\ufb01ned over discrete variables with conditional\nprobability tables. Tracking the suf\ufb01cient statistics for Bayes net structure learning is substantially\nmore costly, and we are currently investigating ways to learn the necessary information in a tractable,\nonline fashion. Similarly, our graph learning relies on constraint-based structure learning since the\nrelevant scores in score-based methods (such as [3]) do not decompose in a manner that is suitable\nfor online learning. We are thus investigating alternative scores, as well as heuristic approximations\nto principled score-based search.\nThere are many real-world contexts in which batch-mode structure learning is either infeasible or\ninappropriate. In particular, the real world frequently involves dynamically varying structures that\nour algorithms must track over time. The online structure learning algorithm presented here has\ngreat potential to perform well in a range of challenging contexts, and at little cost in \u201ctraditional\u201d\nsettings.\n\nAcknowledgments\n\nThanks to Joe Ramsey and Rob Tillman for help with the simulations, and three anonymous review-\ners for helpful comments. DD was partially supported by a James S. McDonnell Foundation Scholar\nAward.\n\n8This shift is almost certainly due to the U.S. recession that occurred in March to November of that year.\n\n8\n\n\fReferences\n[1] R. P. Adams and D. J. C. MacKay. Bayesian online changepoint detection. Technical report,\n\nUniversity of Cambridge, Cambridge, UK, 2007. arXiv:0710.3742v1 [stat.ML].\n\n[2] D. M. Chickering. Learning Bayesian networks is NP-complete. In Proceedings of AI and\n\nStatistics, 1995.\n\n[3] D. M. Chickering. Optimal structure identi\ufb01cation with greedy search. Journal of Machine\n\nLearning Research, 3:507\u2013554, 2002.\n\n[4] F. Desobry, M. Davy, and C. Doncarli. An online kernel change detection algorithm. IEEE\n\nTransactions on Signal Processing, 8:2961\u20132974, 2005.\n\n[5] E. Kummerfeld and D. Danks. Model change and methodological virtues in scienti\ufb01c infer-\n\nence. Technical report, Carnegie Mellon University, Pittsburgh, Pennsylvania, 2013.\n\n[6] S. L. Lauritzen. Graphical models. Clarendon Press, 1996.\n[7] T. Liptak. On the combination of independent tests. Magyar Tud. Akad. Mat. Kutato Int. Kozl.,\n\n3:171\u2013197, 1958.\n\n[8] P. C. Mahalanobis. On the generalized distance in statistics. Proceedings of the National\n\nInstitute of Sciences of India, 2:49\u201355, 1936.\n\n[9] A. McCallum, D. Freitag, and F. C. N. Pereira. Maximum entropy Markov models of informa-\n\ntion extraction and segmentation. In Proceedings of ICML-2000, pages 591\u2013598, 2000.\n\n[10] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.\n[11] M.R. Siracusa and J.W. Fisher III. Tractable bayesian inference of time-series dependence\nstructure. In Proceedings of the 12th International Conference on Arti\ufb01cial Intelligence and\nStatistics, 2009.\n\n[12] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, 2nd\n\nedition, 2000.\n\n[13] R. Sutton. Learning to predict by the methods of temporal differences. Machine Learning,\n\n3:9\u201344, 1988.\n\n[14] M. Talih and N. Hengartner. Structural learning with time-varying components: tracking the\ncross-section of \ufb01nancial time series. Journal of the Royal Statistical Society - Series B: Sta-\ntistical Methodology, 67(3):321\u2013341, 2005.\n\n9\n\n\f", "award": [], "sourceid": 625, "authors": [{"given_name": "Erich", "family_name": "Kummerfeld", "institution": "CMU"}, {"given_name": "David", "family_name": "Danks", "institution": "CMU"}]}