{"title": "Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors", "book": "Advances in Neural Information Processing Systems", "page_first": 397, "page_last": 405, "abstract": "We introduce a new Bayesian nonparametric approach to identification of sparse dynamic linear systems. The impulse responses are modeled as Gaussian processes whose autocovariances encode the BIBO stability constraint, as defined by the recently introduced \u201cStable Spline kernel\u201d. Sparse solutions are obtained by placing exponential hyperpriors on the scale factors of such kernels. Numerical experiments regarding estimation of ARMAX models show that this technique provides a definite advantage over a group LAR algorithm and state-of-the-art parametric identification techniques based on prediction error minimization.", "full_text": "Learning sparse dynamic linear systems using\n\nstable spline kernels and exponential hyperpriors\n\nAlessandro Chiuso\n\nUniversity of Padova\n\nVicenza, Italy\n\nGianluigi Pillonetto\u2217\n\nUniversity of Padova\n\nPadova, Italy\n\nDepartment of Management and Engineering\n\nDepartment of Information Engineering\n\nalessandro.chiuso@unipd.it\n\ngiapi@dei.unipd.it\n\nAbstract\n\nWe introduce a new Bayesian nonparametric approach to identi\ufb01cation of sparse\ndynamic linear systems. The impulse responses are modeled as Gaussian pro-\ncesses whose autocovariances encode the BIBO stability constraint, as de\ufb01ned by\nthe recently introduced \u201cStable Spline kernel\u201d. Sparse solutions are obtained by\nplacing exponential hyperpriors on the scale factors of such kernels. Numerical\nexperiments regarding estimation of ARMAX models show that this technique\nprovides a de\ufb01nite advantage over a group LAR algorithm and state-of-the-art\nparametric identi\ufb01cation techniques based on prediction error minimization.\n\n1\n\nIntroduction\n\nBlack-box identi\ufb01cation approaches are widely used to learn dynamic models from a \ufb01nite set of\ninput/output data [1]. In particular, in this paper we focus on the identi\ufb01cation of large scale linear\nsystems that involve a wide amount of variables and \ufb01nd important applications in many different\ndomains such as chemical engineering, economic systems and computer vision [2]. In this scenario\na key point is that the identi\ufb01cation procedure should be sparsity-favouring, i.e. able to extract from\nthe large number of subsystems entering the system description just that subset which in\ufb02uences\nsigni\ufb01cantly the system output. Such sparsity principle permeates many well known techniques\nin machine learning and signal processing such as feature selection, selective shrinkage and com-\npressed sensing [3, 4].\nIn the classical identi\ufb01cation scenario, Prediction Error Methods (PEM) represent the most used\napproaches to optimal prediction of discrete-time systems [1]. The statistical properties of PEM\n(and Maximum Likelihood) methods are well understood when the model structure is assumed to be\nknown. However, in real applications, \ufb01rst a set of competitive parametric models has to be postu-\nlated. Then, a key point is the selection of the most adequate model structure, usually performed by\nAIC and BIC criteria [5, 6]. Not surprisingly, the resulting prediction performance, when tested on\nexperimental data, may be distant from that predicted by \u201cstandard\u201d (i.e. without model selection)\nstatistical theory, which suggests that PEM should be asymptotically ef\ufb01cient for Gaussian innova-\ntions. If this drawback may affect standard identi\ufb01cation problems, a fortiori it renders dif\ufb01cult the\nstudy of large scale systems where the elevated number of parameters, as compared to the number\nof data available, may undermine the applicability of the theory underlying e.g. AIC and BIC.\nSome novel estimation techniques inducing sparse models have been recently proposed. They in-\nclude the well known Lasso [7] and Least Angle Regression (LAR) [8] where variable selection is\nperformed exploiting the (cid:96)1 norm. This type of penalty term encodes the so called bi-separation\n\u2217This research has been partially supported by the PRIN Project \u201cSviluppo di nuovi metodi e algoritmi\nper l\u2019identi\ufb01cazione, la stima Bayesiana e il controllo adattativo e distribuito\u201d, by the Progetto di Ateneo\nCPDA090135/09 funded by the University of Padova and by the European Community\u2019s Seventh Framework\nProgramme under agreement n. FP7-ICT-223866-FeedNetBack.\n\n1\n\n\ffeature, i.e. it favors solutions with many zero entries at the expense of few large components. Con-\nsistency properties of this method are discussed e.g.\nin [9, 10]. Extensions of this procedure for\ngroup selection include Group Lasso and Group LAR (GLAR) [11] where the sum of the Euclidean\nnorms of each group (in place of the absolute value of the single components) is used. Theoreti-\ncal analyses of these approaches and connections with the multiple kernel learning problem can be\nfound in [12, 13]. However, most of the work has been done in the \u201cstatic\u201d scenario while very little,\nwith some exception [14, 15], can be found regarding the identi\ufb01cation of dynamic systems.\nIn this paper we adopt a Bayesian point of view to prediction and identi\ufb01cation of sparse linear sys-\ntems. Our starting point is the new identi\ufb01cation paradigm developed in [16] that relies on nonpara-\nmetric estimation of impulse responses (see also [17] for extensions to predictor estimation). Rather\nthan postulating \ufb01nite-dimensional structures for the system transfer function, e.g. ARX, ARMAX\nor Laguerre [1], the system impulse response is searched for within an in\ufb01nite-dimensional space.\nThe intrinsical ill-posed nature of the problem is circumvented using Bayesian regularization meth-\nods. In particular, working under the framework of Gaussian regression [18], in [16] the system\nimpulse response is modeled as a Gaussian process whose autocovariance is the so called stable\nspline kernel that includes the BIBO stability constraint.\nIn this paper, we extend this nonparametric paradigm to the design of optimal linear predictors for\nsparse systems. Without loss of generality, analysis is restricted to MISO systems so that we inter-\npret the predictor as a system with m + 1 inputs (given by past outputs and inputs) and one output\n(output predictions). Thus, predictor design amounts to estimating m + 1 impulse responses mod-\neled as realizations of Gaussian processes. We set their autocovariances to stable spline kernels with\ndifferent (and unknown) scale factors which are assigned exponential hyperpriors having a common\nhypervariance. In this way, while GLAR uses the sum of the (cid:96)1 norms of the single impulse re-\nsponses, our approach favors sparsity through an (cid:96)1 penalty on kernel hyperparameters. Inducing\nsparsity by hyperpriors is an important feature of our approach. In fact, this permits to obtain the\nmarginal posterior of the hyperparameters in closed form and hence also their estimates in a robust\nway. Once the kernels are selected, the impulse responses are obtained by a convex Tikhonov-type\nvariational problem. Numerical experiments involving sparse ARMAX systems show that this ap-\nproach provides a de\ufb01nite advantage over both GLAR and PEM (equipped with AIC or BIC) in\nterms of predictive capability on new output data.\nThe paper is organized as follows. In Section 2, the nonparametric approach to system identi\ufb01cation\nintroduced in [16] is brie\ufb02y reviewed. Section 3 reports the statement of the predictor estimation\nproblem while Section 4 describes the new Bayesian model for system identi\ufb01cation of sparse linear\nsystems. In Section 5, a numerical algorithm which returns the unknown components of the prior\nand the estimates of predictor and system impulse responses is derived. In Section 6 we use simu-\nlated data to demonstrate the effectiveness of the proposed approach. Conclusions end the paper.\n\n2 Preliminaries: kernels for system identi\ufb01cation\n\n2.1 Kernel-based regularization\nA widely used approach to reconstruct a function from indirect measurements {yt} consists of min-\nimizing a regularization functional in a reproducing kernel Hilbert space (RKHS) H associated with\na symmetric and positive-de\ufb01nite kernel K [19]. Given N data points, least-squares regularization\nin H estimates the unknown function as\n\nN(cid:88)\n\nt=1\n\n\u02c6h = arg min\nh\n\n(yt \u2212 \u0393t[h])2 + \u03b7(cid:107)h(cid:107)2H\n\n(1)\n\nwhere {\u0393t} are linear and bounded functionals on H related to the measurement model while the\npositive scalar \u03b7 trades off empirical error and solution smoothness [20].\nUnder the stated assumptions and according to the representer theorem [21], the minimizer of (1)\nis the sum of N basis functions de\ufb01ned by the kernel \ufb01ltered by the operators {\u0393t}, with coef\ufb01-\ncients obtainable solving a linear system of equations. Such solution enjoys also an interpretation\nin Bayesian terms. It corresponds to the minimum variance estimate of f when f is a zero-mean\nGaussian process with autocovariance K and {yt \u2212 \u0393t[f ]} is white Gaussian noise independent of\nf [22]. Often, prior knowledge is limited to the fact that the signal, and possibly some of its deriva-\ntives, are continuous with bounded energy. In this case, f is often modeled as the p-fold integral of\n\n2\n\n\fFigure 1: Realizations of a stochastic process f with autocovariance proportional to the standard\nCubic Spline kernel (left), the new Stable Spline kernel (middle) and its sampled version enriched\nby a parametric component de\ufb01ned by the poles \u22120.5 \u00b1 0.6\n\n\u221a\u22121 (right).\n\nwhite noise. If the white noise has unit intensity, the autocorrelation of f is Wp where\n\n(cid:26) u if u \u2265 0\n\n0\n\nif u < 0\n\n(2)\n\n(cid:90) 1\n\n0\n\nWp(s, t) =\n\nGp(s, u)Gp(t, u)du,\n\nGp(r, u) =\n\n,\n\n(u)+ =\n\n(r \u2212 u)p\u22121\n(p \u2212 1)!\n\n+\n\nThis is the autocovariance associated with the Bayesian interpretation of p-th order smoothing\nsplines [23]. In particular, when p = 2, one obtains the cubic spline kernel.\n\n2.2 Kernels for system identi\ufb01cation\n\nIn the system identi\ufb01cation scenario, the main drawback of the kernel (2) is that it does not account\nfor impulse response stability. In fact, the variance of f increases over time. This can be easily\nappreciated by looking at Fig. 1 (left) which displays 100 realizations drawn from a zero-mean\nGaussian process with autocovariance proportional to W2. One of the key contributions of [16] is\nthe de\ufb01nition of a kernel speci\ufb01cally suited to linear system identi\ufb01cation leading to an estimator\nwith favorable bias and variance properties. In particular, it is easy to see that if the autocovariance\nof f is proportional to Wp, the variance of f (t) is zero at t = 0 and tends to \u221e as t increases.\nHowever, if f represents a stable impulse response, we would rather let it have a \ufb01nite variance at\nt = 0 which goes exponentially to zero as t tends to \u221e. This property can be ensured by considering\nautocovariances proportional to the class of kernels given by\n\nKp(s, t) = Wp(e\u2212\u03b2s, e\u2212\u03b2t),\n\n(3)\nwhere \u03b2 is a positive scalar governing the decay rate of the variance [16]. In practice, \u03b2 will be\nunknown so that it is convenient to treat it as a hyperparameter to be estimated from data.\nIn view of (3), if p = 2 the autocovariance becomes the Stable Spline kernel introduced in [16]:\n\ns, t \u2208 R+\n\nK2(t, \u03c4 ) =\n\ne\u2212\u03b2(t+\u03c4 )e\u2212\u03b2 max(t,\u03c4 )\n\n2\n\n\u2212 e\u22123\u03b2 max(t,\u03c4 )\n\n6\n\n(4)\n\nProposition 1 [16] Let f be zero-mean Gaussian with autocovariance K2. Then, with probability\none, the realizations of f are continuous impulse responses of BIBO stable dynamic systems.\n\nThe effect of the stability constraint is visible in Fig. 1 (middle) which displays 100 realizations\ndrawn from a zero-mean Gaussian process with autocovariance proportional to K2 with \u03b2 = 0.4.\n\n3 Statement of the system identi\ufb01cation problem\nIn what follows, vectors are column vectors, unless other is speci\ufb01ed. We denote with {yt}t\u2208Z,\nyt \u2208 R and {ut}t\u2208Z, ut \u2208 Rm a pair of jointly stationary stochastic processes which represent,\n\n3\n\n\fFigure 2: Bayesian network describing the new nonparametric model for identi\ufb01cation of sparse\nlinear systems where yl := [yl\u22121, yl\u22122, . . .] and, in the reduced model, \u03bb := \u03bb1 = . . . = \u03bbm+1.\n\nrespectively, the output and input of an unknown time-invariant dynamical system. With some\nabuse of notation, yt will both denote a random variable (from the random process {yt}t\u2208Z) and its\nsample value. The same holds for ut. Our aim is to identify a linear dynamical system of the form\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\nyt =\n\nfiut\u2212i +\n\ngiet\u2212i\n\n(5)\n\ni=1\n\ni=0\n\n(cid:2)(cid:80)\u221e\n\nyt =(cid:80)m\n\nIn (5), fi \u2208 R1\u00d7m and gi \u2208 R are matrix and scalar coef\ufb01cients of the\n\nfrom {ut, yt}t=1,..,N .\nunknown system impulse responses while et is the Gaussian innovation sequence.\nFollowing the Prediction Error Minimization framework, identi\ufb01cation of the dynamical system (5)\nt }t\u2208N denote\nis converted in estimation of the associated one-step-ahead predictor. Letting hk := {hk\nthe predictor impulse response associated with the k-th input {uk\ni=1 hm+1\n\n(6)\nwhere hm+1 := {hm+1\n}t\u2208N is the impulse response modeling the autoregressive component of the\npredictor. As is well known, if the joint spectrum of {yt} and {ut} is bounded away from zero, each\nhk is (BIBO) stable. Under such assumption, our aim is to estimate the predictor impulse responses,\nin a scenario where the number of measurements N is not large, as compared with m, and many\nmeasured inputs could be irrelevant for the prediction of yt. We will focus on the identi\ufb01cation of\nARMAX models, so that the zeta-transforms of {hk} are rational functions all sharing the same\ndenominator, even if the approach described below immediately extends to general linear systems.\n\n(cid:3) +(cid:80)\u221e\n\nt }t\u2208Z, one has\n\nyt\u2212i + et\n\ni\n\nk=1\n\ni=1 hk\n\ni uk\n\nt\u2212i\n\nt\n\n4 A Bayesian model for identi\ufb01cation of sparse linear systems\n\n4.1 Prior for predictor impulse responses\nWe model {hk} as independent Gaussian processes whose kernels share the same hyperparameters\napart from the scale factors. In particular, each hk is proportional to the convolution of a zero-\nmean Gaussian process, with autocovariance given by the sampled version of K2, with a parametric\nimpulse response r, used to capture dynamics hardly represented by a smooth process, e.g. high-\nfrequency oscillations. For instance, the zeta-transform R(z) of r can be parametrized as follows\n\nR(z) =\n\nz2\n\nP\u03b8(z)\n\n,\n\nP\u03b8(z) = z2 + \u03b81z + \u03b82,\n\n\u03b8 \u2208 \u0398 \u2282 R2\n\n(7)\n\nwhere the feasible region \u0398 constraints the two roots of P\u03b8(z) to belong to the open left unit\nsemicircle in the complex plane. To better appreciate the role of the \ufb01nite-dimensional compo-\nnent of the model, Fig. 1 (right panel) shows some realizations (with samples linearly interpolated)\ndrawn from a discrete-time zero-mean normal process with autocovariance given by K2 enriched by\n\u03b8 = [1 0.61] in (7). Notice that, in this way, an oscillatory behavior is introduced in the realizations\n\n4\n\n\fby enriching the Stable Spline kernel with the poles \u22120.5 \u00b1 0.6\nThe kernel of hk de\ufb01ned by K2 and (7) is denoted by K : N \u00d7 N (cid:55)\u2192 R and depends on \u03b2, \u03b8. Thus,\nletting E[\u00b7] denote the expectation operator, the prior model on the impulse responses is given by\n\n\u221a\u22121.\n\nE[hk\n\nj hk\n\ni ] = \u03bb2\n\nkK(j, i; \u03b8, \u03b2),\n\nk = 1, . . . , m + 1,\n\ni, j \u2208 N\n\n4.2 Hyperprior for the hyperparameters\n\nThe noise variance \u03c32 will always be estimated via a preliminary step using a low-bias ARX model,\nas described in [24]. Thus, this parameter will be assumed known in the description of our Bayesian\nmodel. The hyperparameters \u03b2, \u03b8 and {\u03bbk} are instead modeled as mutually independent random\nvectors. \u03b2 is given a non informative probability density on R+ while \u03b8 has a uniform distribution\non \u0398. Each \u03bbk is an exponential random variable with inverse of the mean (and SD) \u03b3 \u2208 R+, i.e.\n\np(\u03bbk) = \u03b3 exp (\u2212\u03b3\u03bbk) \u03c7(\u03bbk \u2265 0),\n\nk = 1, . . . , m + 1\n\nwith \u03c7 the indicator function. We also interpret \u03b3 as a random variable with a non informative prior\non R+. Finally, \u03b6 indicates the hyperparameter random vector, i.e. \u03b6 := [\u03bb1, . . . , \u03bbm+1, \u03b81, \u03b82, \u03b2, \u03b3].\n\n4.3 The full Bayesian model\nLet Ak \u2208 RN\u00d7\u221e where, for j = 1, . . . , N and i \u2208 N, we have:\n\n[Ak]ji = uk\n\n(8)\nIn view of (6), using notation of ordinary algebra to handle in\ufb01nite-dimensional objects with each\nhk interpreted as an in\ufb01nite-dimensional column vector, it holds that\n\n[Am+1]ji = yj\u2212i\n\nfor k = 1, . . . , m,\n\nj\u2212i\n\nm(cid:88)\n\ny+ =\n\nAk(uk)hk + Am+1(y+, y-)hm+1 + e\n\n(9)\n\nk=1\n\ny- = [y0, y\u22121, y\u22122, . . .]T ,\n\ny+ = [y1, y2, . . . , yN ]T ,\n\nwhere\n\ne = [e1, e2, . . . , eN ]T (10)\nIn practice, y- is never completely known and a solution is to set its unknown components to zero,\nsee e.g. Section 3.2 in [1]. Further, the following approximation is exploited:\np(y+,{hk}, y-|\u03b6) \u2248 p(y+|{hk}, y-, \u03b6)p({hk}|\u03b6)p(y-)\n\n(11)\nthe past y- is assumed not to carry information on the predictor impulse responses and the\n\ni.e.\nhyperparameters. Our stochastic model is described by the Bayesian network in Fig. 2 (left side).\nThe dependence on y- is hereafter omitted as well as dependence of the {Ak} on y+ or uk. We\nstart reporting a preliminary lemma, whose proof can be found in [17], which will be needed in\npropositions 2 and 3.\nLemma 1 Let the roots of P\u03b8 in (7) be stable. Then, if {yt} and {ut} are zero mean, \ufb01nite variance\nstationary stochastic processes, each operator {Ak} is almost surely (a.s.) continuous in HK.\n\n5 Estimation of the hyper-parameters and the predictor impulse responses\n\n5.1 Estimation of the hyper-parameters\n\nWe estimate the hyperparameter vector \u03b6 by optimizing its marginal posterior, i.e. the joint density\nof y+, \u03b6 and {hk} where all the {hk} are integrated out. This is described in the next proposition\nthat derives from simple manipulations of probability densities whose well-posedness is guaranteed\nby lemma 1. Below, IN is the N \u00d7 N identity matrix while, with a slight abuse of notation, K is\nnow seen as an element of R\u221e\u00d7\u221e, i.e. its i-th column is the sequence K(\u00b7, i), i \u2208 N.\nProposition 2 Let {yt} and {ut} be zero mean, \ufb01nite variance stationary stochastic processes.\nThen, under the approximation (11), the maximum a posteriori estimate of \u03b6 given y+ is\n\n\u02c6\u03b6 = arg min\n\n\u03b6\n\nJ(y+; \u03b6)\n\ns.t.\n\n\u03b8 \u2208 \u0398,\n\n\u03b3, \u03b2 > 0,\n\n\u03bbk \u2265 0\n\n(k = 1, . . . , m + 1)\n\n(12)\n\n5\n\n\fm+1(cid:88)\n\nwhere J is almost surely well de\ufb01ned pointwise and given by\n\nlog(cid:0)det[2\u03c0V [y+]](cid:1) +\n\n1\n2\n\n1\n2\n\n\u03bbk \u2212 log(\u03b3)\n\nJ(y+; \u03b6) =\n\n(y+)T (V [y+])\u22121y+ + \u03b3\n\nwith V [y+] = \u03c32IN +(cid:80)m+1\nnected with multiple kernel learning, see Section 3 in [25]. Additional terms are log(cid:0)det[V [y+]](cid:1)\n\nThe objective (13), including the (cid:96)1 penalty on {\u03bbk}, is a Bayesian modi\ufb01ed version of that con-\n\nk .\nk=1 \u03bbkAkKAT\n\n(13)\n\nk=1\n\nand log(\u03b3) that permits to estimate the weight of the (cid:96)1 norm jointly with the other hyperparameters.\nAn important issue for the practical use of our numerical scheme is the availability of a good start-\ning point for the optimizer. Below, we describe a scheme that achieves a suboptimal solution just\nsolving an optimization problem in R4 related to the reduced Bayesian model of Fig. 2 (right side).\n\n(cid:34)\n\n(cid:35)\n\nm+1(cid:88)\n\nk=1\n\ni) Obtain {\u02c6\u03bbk}, \u02c6\u03b8 and \u02c6\u03b2 solving the following modi\ufb01ed version of problem (12)\n\narg min\n\n\u03b6\n\nJ(y+; \u03b6) \u2212 \u03b3\n\n\u03bbk + log(\u03b3)\n\ns.t. \u03b8 \u2208 \u0398,\n\n\u03b2 > 0,\n\n\u03bb1 = . . . = \u03bbm+1 \u2265 0\n\nii) Set \u02c6\u03b3 = 1/\u02c6\u03bb1 and \u02c6\u03b6 = [\u02c6\u03bb1, . . . , \u02c6\u03bbm+1, \u02c6\u03b8, \u02c6\u03b2, \u02c6\u03b3]. Then, for k = 1, . . . , m + 1: set \u00af\u03b6 = \u02c6\u03b6\n\nexcept for the k-th component of \u00af\u03b6 which is set to 0; if J(y+; \u00af\u03b6) \u2264 J(y+; \u02c6\u03b6), set \u02c6\u03b6 = \u00af\u03b6.\n\n5.2 Estimation of the predictor impulse responses for known \u03b6\nLet HK be the RKHS associated with K, with norm (cid:107) \u00b7 (cid:107)HK . Let also \u02c6hk = E[hk|y+, \u03b6]. The\nfollowing result comes from the representer theorem whose applicability is guaranteed by lemma 1.\n\n{\u02c6hk}m+1\n\nm+1(cid:88)\nProposition 3 Under the same assumptions of Proposition 2, almost surely we have\n(cid:107)f k(cid:107)2HK\n(cid:33)\u22121\n\n(cid:107)y+ \u2212 m+1(cid:88)\n(cid:32)\n\nwhere (cid:107) \u00b7 (cid:107) is the Euclidean norm. Moreover, almost surely we also have for k = 1, . . . , m + 1\n\nAkf k(cid:107)2 + \u03c32\n\n{f k\u2208HK}m+1\n\nk=1 = arg\n\nmin\n\n\u03bb2\nk\n\nk=1\n\nk=1\n\nk=1\n\nm+1(cid:88)\n\n\u02c6hk = \u03bb2\n\nkKAT\n\nk c,\n\nc =\n\n\u03c32IN +\n\n\u03bbkAkKAT\nk\n\ny+\n\n(14)\n\nAfter obtaining the estimates of the {hk}, simple formulas can then be used to derive the system\nimpulse responses f and g in (5) and hence also the k-step ahead predictors, see [1] for details.\n\nk=1\n\n6 Numerical experiments\n\nWe consider two Monte Carlo studies of 200 runs where at any run an ARMAX linear system with\n15 inputs is generated as follows\n\n\u2022 the number of hk different from zero is randomly drawn from the set {0, 1, 2, .., 8}.\n\u2022 Then, the order of the ARMAX model is randomly chosen in [1, 30] and the model is\ngenerated by the MATLAB function drmodel.m. The system and the predictor poles are\nrestricted to have modulus less than 0.95 with the (cid:96)2 norm of each hk bounded by 10.\n\nIn the \ufb01rst Monte Carlo experiment, at any run an identi\ufb01cation data set of size 500 and a test set\nof size 1000 is generated using independent realizations of white noise as input.\nIn the second\nexperiment, the prediction on new data is more challenging. In fact, at any run, an identi\ufb01cation\ndata set of size 500 and a test set of size 1000 is generated via the MATLAB function idinput.m\nusing, respectively, independent realizations of a random Gaussian signal with band [0, 0.8] and\n[0, 0.9] (the interval boundaries specify the lower and upper limits of the passband, expressed as\nfractions of the Nyquist frequency). We compare the following estimators:\n\n6\n\n\fFigure 3: Boxplots of the values of COD1 obtained by PEM+Or, Stable Spline, GLAR and\nPEM+BIC in the two experiments. The outliers obtained by PEM+BIC are not all displayed.\n\nExperiment\n\n#1\n#2\n\nPEM+Oracle\n\n100%\n100%\n\nStable Spline\n\n98.7%\n98.4%\n\nSubopt. Stable Spline GLAR\n45.6%\n52.4%\n\n97.5%\n98.2%\n\nTable 1: Percentage of the hk equal to zero correctly set to zero by the employed estimator.\n\n1. GLAR: this is the GLAR algorithm described in [11] applied to ARX models; the order\n(between 1 and 30) and the level of sparsity (i.e. the number of null hk) is determined using\nthe \ufb01rst 2/3 of the 500 available data as training set and the remaining part as validation\ndata (the use of Cp statistics does not provide better results in this case).\n\n2. PEM+Oracle: this is the classical PEM approach, as implemented in the pem.m function\nof the MATLAB System Identi\ufb01cation Toolbox [26], equipped with an oracle that, at every\nrun, knows which predictor impulse response are zero and, having access to the test set,\nselects those model orders that provide the best prediction performance.\n\n3. PEM+BIC: this is the classical PEM approach that uses BIC for model order selection. The\norder of the polynomials in the ARMAX model are not allowed to be different each other\nsince this would lead to a combinatorial explosion of the number of competitive models.\n4. Stable Spline: this is the approach based on the full Bayesian model of Fig. 2. The \ufb01rst\n40 available input/output pairs enter the {Ak} in (9) so that N = 460. For computational\nreasons, the number of estimated predictor coef\ufb01cients is 40.\n\n5. Suboptimal Stable Spline: the same as above except that we exploit the reduced Bayesian\nmodel of Fig. 2 complemented with the procedure described at the end of subsection 5.1.\n\nThe following performance indexes are considered:\n\n1. Percentage of the impulse responses equal to zero correctly set to zero by the estimator.\n2. k-step-ahead Coef\ufb01cient of Determination, denoted by CODk, quantifying how much of\n\nthe test set variance is explained by the forecast. It is computed at each run as\n\nCODk := 1\u2212\n\n1\n\n1000\n\n(cid:80)1000\n\nRM S2\nk\ni=1 (ytest\n\nt \u2212 \u00afytest\n\nt\n\n,\n\n)2\n\nRM Sk :=\n\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n\n1000\n\n1000(cid:88)\n\nt \u2212 \u02c6ytest\n(ytest\n\nt|t\u2212k)2\n\nt=1\n\n(15)\n\n7\n\nPEM+OrStable SplineGLARPEM+BIC\u22121.5\u22121\u22120.500.51COD1#1PEM+OrStable SplineGLARPEM+BIC\u22122\u22121.5\u22121\u22120.500.51COD1#2\fFigure 4: CODk, i.e. average coef\ufb01cient of determination relative to k-step ahead prediction, ob-\ntained during the Monte Carlo study #1 (top) and #2 (bottom) using PEM+Oracle (\u2022), GLAR (\u2217)\nStable Spline based on the full (\u25e6) and the reduced (+) Bayesian model of Fig. 2.\n\nwhere \u00afytest is the sample mean of the test set data {ytest\nt|t\u2212k is the k-step\nahead prediction computed using the estimated model. The average index obtained during\nthe Monte Carlo study, as a function of k, is then denoted by CODk.\n\n}1000\nt=1 and \u02c6ytest\n\nt\n\nNotice that, in both of the cases, the larger the index, the better is the performance of the estimator.\nIn every experiment the performance of PEM+BIC has been largely unsatisfactory, providing\nstrongly negative values for CODk. This is illustrated e.g. in Fig. 3 showing the boxplots of the\n200 values of COD1 obtained by 4 of the employed estimators during the two Monte Carlo studies.\nWe have also assessed that results do not improve using AIC. In view of this, in what follows other\nresults from PEM+BIC will not be shown.\nTable 1 reports the percentage of the predictor impulse responses equal to zero correctly estimated as\nzero by the estimators. Remarkably, in all the cases the Stable Spline estimators not only outperform\nGLAR but the achieved percentage is close to 99%. This shows that the use of the marginal posterior\npermits to effectively detect the subset of the {\u03bbk} equal to zero. Finally, Fig. 4 displays CODk as a\nfunction of the prediction horizon obtained during the Monte Carlo study #1 (top) and #2 (bottom).\nThe performance of Stable Spline appears superior than that of GLAR and is comparable with that\nof PEM+Oracle also when the reduced Bayesian model of Fig. 2 is used.\n\n7 Conclusions\n\nWe have shown how identi\ufb01cation of large sparse dynamic systems can bene\ufb01t from the \ufb02exibility\nof kernel methods. To this aim, we have extended a recently proposed nonparametric paradigm to\nidentify sparse models via prediction error minimization. Predictor impulse responses are modeled\nas zero-mean Gaussian processes using stable spline kernels encoding the BIBO-stability constraint\nand sparsity is induced by exponential hyperpriors on their scale factors. The method compares\nmuch favorably with GLAR, with its performance close to that achievable combining PEM with an\noracle which exploits the test set in order to select the best model order. In the near future we plan to\nprovide a theoretical analysis characterizing the hyperprior-based scheme as well as to design new\nad hoc optimization schemes for hyperparameters estimation.\n\n8\n\n12345678910111213141516171819200.60.70.80.91Average COD #11234567891011121314151617181920\u22120.500.51kAverage COD #2Stable SplineSuboptimal Stable Spline PEM + OracleGLAR\fReferences\n[1] L. Ljung. System Identi\ufb01cation - Theory For the User. Prentice Hall, 1999.\n[2] J. Mohammadpour and K.M. Grigoriadis. Ef\ufb01cient Modeling and Control of Large-scale Systems.\n\nSpringer, 2010.\n\n[3] T. J. Hastie and R. J. Tibshirani. Generalized additive models. In Monographs on Statistics and Applied\n\nProbability, volume 43. Chapman and Hall, London, UK, 1990.\n\n[4] D. Donoho. Compressed sensing. IEEE Trans. on Information Theory, 52(4):1289\u20131306, 2006.\n[5] H. Akaike. A new look at the statistical model identi\ufb01cation. IEEE Transactions on Automatic Control,\n\n19:716\u2013723, 1974.\n\n[6] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461\u2013464, 1978.\n[7] R. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society,\n\nSeries B., 58, 1996.\n\n[8] B. Efron, T. Hastie, L. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407\u2013\n\n499, 2004.\n\n[9] P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research,\n\n7:2541\u20132563, 2006.\n\n[10] H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association,\n\n101:1418\u20131429, 2006.\n\n[11] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of\n\nthe Royal Statistical Society, Series B, 68:49\u201367, 2006.\n\n[12] F.R. Bach. Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res., 9:1179\u2013\n\n1225, 2008.\n\n[13] C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine\n\nLearning Research, 6:1099\u20131125, 2005.\n\n[14] H. Wang, G. Li, and C.L. Tsai. Regression coef\ufb01cient and autoregressive order shrinkage and selection\n\nvia the lasso. Journal Of The Royal Statistical Society Series B, 69(1):63\u201378, 2007.\n\n[15] Nan-Jung Hsu, Hung-Lin Hung, and Ya-Mei Chang. Subset selection for vector autoregressive processes\n\nusing lasso. Computational Statistics and Data Analysis, 52:3645\u20133657, 2008.\n\n[16] G. Pillonetto and G. De Nicolao. A new kernel-based approach for linear system identi\ufb01cation. Automat-\n\nica, 46(1):81\u201393, 2010.\n\n[17] G. Pillonetto, A. Chiuso, and G. De Nicolao. Prediction error identi\ufb01cation of linear systems: a nonpara-\n\nmetric Gaussian regression approach. Automatica (in press), 2011.\n\n[18] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.\n[19] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society,\n\n68:337\u2013404, 1950.\n\n[20] G. Wahba. Support vector machines, reproducing kernel Hilbert spaces and randomized GACV. Technical\n\nReport 984, Department of Statistics, University of Wisconsin, 1998.\n\n[21] G. Kimeldorf and G. Wahba. Some results on Tchebychef\ufb01an spline functions. Journal of Mathematical\n\nAnalysis and Applications, 33(1):82\u201395, 1971.\n\n[22] A. J. Smola and B. Sch\u00a8olkopf. Bayesian kernel methods.\n\nIn S. Mendelson and A. J. Smola, editors,\nMachine Learning, Proceedings of the Summer School, Australian National University, pages 65\u2013117,\nBerlin, Germany, 2003. Springer-Verlag.\n\n[23] G. Wahba. Spline models for observational data. SIAM, Philadelphia, 1990.\n[24] G.C. Goodwin, M. Gevers, and B. Ninness. Quantifying the error in estimated transfer functions with\n\napplication to model order selection. IEEE Transactions on Automatic Control, 37(7):913\u2013928, 1992.\n\n[25] F. Dinuzzo. Kernel machines with two layers and multiple kernel learning. Technical report, Preprint\n\narXiv:1001.2709, 2010. Available at http://www-dimat.unipv.it/ dinuzzo.\n\n[26] L. Ljung. System Identi\ufb01cation Toolbox V7.1 for Matlab. Natick, MA: The MathWorks, Inc., 2007.\n\n9\n\n\f", "award": [], "sourceid": 575, "authors": [{"given_name": "Alessandro", "family_name": "Chiuso", "institution": null}, {"given_name": "Gianluigi", "family_name": "Pillonetto", "institution": null}]}