{"title": "Kernel Observers: Systems-Theoretic Modeling and Inference of Spatiotemporally Evolving Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 3990, "page_last": 3998, "abstract": "We consider the problem of estimating the latent state of a spatiotemporally evolving continuous function using very few sensor measurements. We show that layering a dynamical systems prior over temporal evolution of weights of a kernel model is a valid approach to spatiotemporal modeling that does not necessarily require the design of complex nonstationary kernels. Furthermore, we show that such a predictive model can be utilized to determine sensing locations that guarantee that the hidden state of the phenomena can be recovered with very few measurements. We provide sufficient conditions on the number and spatial location of samples required to guarantee state recovery, and provide a lower bound on the minimum number of samples required to robustly infer the hidden states. Our approach outperforms existing methods in numerical experiments.", "full_text": "Kernel Observers: Systems-Theoretic Modeling and\nInference of Spatiotemporally Evolving Processes\n\nHassan A. Kingravi\n\nPindrop\n\nAtlanta, GA 30308\n\nhkingravi@pindrop.com\n\nHarshal Maske and Girish Chowdhary\nUniversity of Illinois at Urbana Champaign\n\nUrbana, IL 61801\n\nhmaske2@illinois.edu, girishc@illinois.edu\n\nAbstract\n\nWe consider the problem of estimating the latent state of a spatiotemporally evolv-\ning continuous function using very few sensor measurements. We show that\nlayering a dynamical systems prior over temporal evolution of weights of a kernel\nmodel is a valid approach to spatiotemporal modeling, and that it does not require\nthe design of complex nonstationary kernels. Furthermore, we show that such a\ndifferentially constrained predictive model can be utilized to determine sensing\nlocations that guarantee that the hidden state of the phenomena can be recovered\nwith very few measurements. We provide suf\ufb01cient conditions on the number and\nspatial location of samples required to guarantee state recovery, and provide a lower\nbound on the minimum number of samples required to robustly infer the hidden\nstates. Our approach outperforms existing methods in numerical experiments.\n\n1\n\nIntroduction\n\nModeling of large-scale stochastic phenomena with both spatial and temporal (spatiotemporal)\nevolution is a fundamental problem in the applied sciences and social networks. The spatial and\ntemporal evolution in such domains is constrained by stochastic partial differential equations, whose\nstructure and parameters may be time-varying and unknown. While modeling spatiotemporal\nphenomena has traditionally been the province of the \ufb01eld of geostatistics, it has in recent years\ngained more attention in the machine learning community [2]. The data-driven models developed\nthrough machine learning techniques provide a way to capture complex spatiotemporal phenomena\nthat are not easily modeled by \ufb01rst-principles alone, such as stochastic partial differential equations.\nIn the machine learning community, kernel methods represent a class of extremely well-studied and\npowerful methods for inference in spatial domains; in these techniques, correlations between the input\nvariables are encoded through a covariance kernel, and the model is formed through a linear weighted\ncombination of the kernels [14]. In recent years, kernel methods have been applied to spatiotemporal\nmodeling with varying degrees of success [2, 14]. Many recent techniques in spatiotemporal modeling\nhave focused on nonstationary covariance kernel design and associated hyperparameter learning\nalgorithms [4, 7, 12]. The main bene\ufb01t of careful design of covariance kernels over approaches that\nsimply include time as an additional input variable is that they can account for intricate spatiotemporal\ncouplings. However, there are two key challenges with these approaches: the \ufb01rst is ensuring\nthe scalability of the model to large scale phenomena, which manifests due to the fact that the\nhyperparameter optimization problem is not convex in general, leading to methods that are dif\ufb01cult\nto implement, susceptible to local minima, and that can become computationally intractable for\nlarge datasets. In addition to the challenge of modeling spatiotemporally varying processes, we\nare interested in addressing the second very important, and widely unaddressed challenge: given a\npredictive model of the spatiotemporal phenomena, how can the current latent state of the phenomena\nbe estimated using as few sensor measurements as possible? This is called the monitoring problem.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fMonitoring a spatiotemporal phenomenon is concerned with estimating its current state, predicting\nits future evolution, and inferring the initial conditions utilizing limited sensor measurements. The\nkey challenges here manifest due to the fact that it is typically infeasible or expensive to deploy\nsensors at a large scale across vast spatial domains. To minimize the number of sensors deployed, a\npredictive data-driven model of the spatiotemporal evolution could be learned from historic datasets\nor through remote sensing (e.g. satellite, radar) datasets. Then, to monitor the phenomenon, the\nkey problem would boil down to reliably and quickly estimating the evolving latent state of the\nphenomena utilizing measurements from very few sampling locations.\nIn this paper, we present an alternative perspective on solving the spatiotemporal monitoring problem\nthat brings together kernel-based modeling, systems theory, and Bayesian \ufb01ltering. Our main\ncontributions are two-fold: \ufb01rst, we demonstrate that spatiotemporal functional evolution can be\nmodeled using stationary kernels with a linear dynamical systems layer on their mixing weights. In\nother words, the model proposed here posits differential constraints, embodied as a linear dynamical\nsystem, on the spatiotemporal evolution of a kernel based models, such as Gaussian Processes.\nThis approach does not necessarily require the design of complex spatiotemporal kernels, and can\naccommodate positive-de\ufb01nite kernels on any domain on which it\u2019s possible to de\ufb01ne them, which\nincludes non-Euclidean domains such as Riemannian manifolds, strings, graphs and images [6].\nSecond, we show that the model can be utilized to determine sensing locations that guarantee that the\nhidden states of functional evolution can be estimated using a Bayesian state-estimator with very few\nmeasurements. We provide suf\ufb01cient conditions on the number and location of sensor measurements\nrequired and prove non-conservative lower bounds on the minimum number of sampling locations.\nThe validity of the presented model and sensing techniques is corroborated using synthetic and large\nreal datasets.\n\n1.1 Related Work\n\nThere is a large body of literature on spatiotemporal modeling in geostatistics where speci\ufb01c process-\ndependent kernels can be used [17, 2]. From the machine learning perspective, a naive approach is to\nutilize both spatial and temporal variables as inputs to a Mercer kernel [10]. However, this technique\nleads to an ever-growing kernel dictionary. Furthermore, constraining the dictionary size or utilizing\na moving window will occlude learning of long-term patterns. Periodic or nonstationary covariance\nfunctions and nonlinear transformations have been proposed to address this issue [7, 14]. Work\nfocusing on nonseparable and nonstationary covariance kernels seeks to design kernels optimized\nfor environment-speci\ufb01c dynamics, and to tune their hyperparameters in local regions of the input\nspace. Seminal work in [5] proposes a process convolution approach for space-time modeling. This\nmodel captures nonstationary structure by allowing the convolution kernel to vary across the input\nspace. This approach can be extended to a class of nonstationary covariance functions, thereby\nallowing the use of a Gaussian process (GP) framework, as shown in [9]. However, since this\nmodel\u2019s hyperparameters are inferred using MCMC integration, its application has been limited to\nsmaller datasets. To overcome this limitation, [12] proposes to use the mean estimates of a second\nisotropic GP (de\ufb01ned over latent length scales) to parameterize the nonstationary covariances. Finally,\n[4] considers nonistropic variation across different dimension of input space for the second GP as\nopposed to isotropic variation by [12]. Issues with this line of approach include the nonconvexity of\nthe hyperparameter optimization problem and the fact that selection of an appropriate nonstationary\ncovariance function for the task at hand is a nontrivial design decision (as noted in [16]).\nApart from directly modeling the covariance function using additional latent GPs, there exist several\nother approaches for specifying nonstationary GP models. One approach maps the nonstationary\nspatial process into a latent space, in which the problem becomes approximately stationary [15].\nAlong similar lines, [11] extends the input space by adding latent variables, which allows the model\nto capture nonstationarity in original space. Both these approaches require MCMC sampling for\ninference, and as such are subject to the limitations mentioned in the preceding paragraph.\nA geostatistics approach that \ufb01nds dynamical transition models on the linear combination of weights\nof a parameterized model [2, 8] is advantageous when the spatial and temporal dynamics are\nhierarchically separated, leading to a convex learning problem. As a result, complex nonstationary\nkernels are often not necessary (although they can be accommodated). The approach presented in this\npaper aligns closely with this vein of work. A system theoretic study of this viewpoint enables the\nfundamental contributions of the paper, which are 1) allowing for inference on more general domains\nwith a larger class of basis functions than those typically considered in the geostatistics community,\n\n2\n\n\fFigure 1: Two types of Hilbert space evolutions.\nLeft: discrete switches in RKHS H; Right: smooth\nevolution in H.\n\n(a) 1-shaded (Def. 1)(b) 2-shaded (Eq. (4))\n\nFigure 2: Shaded observation matrices for dictio-\nnary of atoms.\n\nand 2) quantifying the minimum number of measurements required to estimate the state of functional\nevolution.\nIt should be noted that the contribution of the paper concerning sensor placement is to provide\nsuf\ufb01cient conditions for monitoring rather than optimization of the placement locations, hence a\ncomparison with these approaches is not considered in the experiments.\n2 Kernel Observers\n\nThis section outlines our modeling framework and presents theoretical results associated with the\nnumber of sampling locations required for monitoring functional evolution.\n2.1 Problem Formulation\n\nWe focus on predictive inference of a time-varying stochastic process, whose mean f evolves\ntemporally as f\u03c4 +1 \u223c F(f\u03c4 , \u03b7\u03c4 ), where F is a distribution varying with time \u03c4 and exogenous inputs\n\u03b7. Our approach builds on the fact that in several cases, temporal evolution can be hierarchically\nseparated from spatial functional evolution. A classical and quite general example of this is the\nabstract evolution equation (AEO), which can be de\ufb01ned as the evolution of a function u embedded\nin a Banach space B: \u02d9u(t) = Lu(t), subject to u(0) = u0, and L : B \u2192 B determines spatiotemporal\ntransitions of u \u2208 B [1]. This model of spatiotemporal evolution is very general (AEOs, for example,\nmodel many PDEs), but working in Banach spaces can be computationally taxing. A simple way\nto make the approach computationally realizable is to place restrictions on B: in particular, we\nrestrict the sequence f\u03c4 to lie in a reproducing kernel Hilbert space (RKHS), the theory of which\nprovides powerful tools for generating \ufb02exible classes of functions with relative ease [14]. In a\nkernel-based model, k : \u2126 \u00d7 \u2126 \u2192 R is a positive-de\ufb01nite Mercer kernel on a domain \u2126 that models\nthe covariance between any two points in the input space, and implies the existence of a smooth\nmap \u03c8 : \u2126 \u2192 H, where H is an RKHS with the property k(x, y) = (cid:104)\u03c8(x), \u03c8(y)(cid:105)H. The key insight\nbehind the proposed model is that spatiotemporal evolution in the input domain corresponds to\ntemporal evolution of the mixing weights of a kernel model alone in the functional domain. Therefore,\nf\u03c4 can be modeled by tracing the evolution of its mean embedded in a RKHS using switched ordinary\ndifferential equations (ODE) when the evolution is continuous, or switched difference equations when\nit is discrete (Figure 1). The advantage of this approach is that it allows us to utilize powerful ideas\nfrom systems theory for deriving necessary and suf\ufb01cient conditions for spatiotemporal monitoring. In\nthis paper, we restrict our attention to the class of functional evolutions F de\ufb01ned by linear Markovian\ntransitions in an RKHS. While extension to the nonlinear case is possible (and non-trivial), it is not\npursued in this paper to help ease the exposition of the key ideas. The class of linear transitions in\nRKHS is rich enough to model many real-world datasets, as suggested by our experiments.\nLet y\u03c4 \u2208 RN be the measurements of the function available from N sensors at time \u03c4, A : H \u2192 H\nbe a linear transition operator in the RKHS H, and K : H \u2192 RN be a linear measurement operator.\nThe model for the functional evolution and measurement studied in this paper is:\n\n(1)\nwhere \u03b7\u03c4 is a zero-mean stochastic process in H, and \u03b6\u03c4 is a Wiener process in RN . Classical\ntreatments of kernel methods emphasize that for most kernels, the feature map \u03c8 is unknown,\nand possibly in\ufb01nite-dimensional; this forces practioners to work in the dual space of H, whose\ndimensionality is the number of samples in the dataset being modeled. This conventional wisdom\nprecludes the use of kernel methods for most tasks involving modern datasets, which may have\n\nf\u03c4 +1 = Af\u03c4 + \u03b7\u03c4 ,\n\ny\u03c4 = Kf\u03c4 + \u03b6\u03c4 ,\n\n3\n\n 0.10.20.30.40.50.60.70.80.9 0.10.20.30.40.50.60.70.80.9\fmillions and sometimes billions of samples [13]. An alternative is to work with a feature map\n\n(cid:98)\u03c8(x) := [ (cid:98)\u03c81(x) \u00b7\u00b7\u00b7 (cid:98)\u03c8M (x) ] to an approximate feature space (cid:98)H, with the property that for every\nelement f \u2208 H, \u2203(cid:98)f \u2208 (cid:98)H and an \u0001 > 0 s.t. (cid:107)f \u2212(cid:98)f(cid:107) < \u0001 for an appropriate function norm. A few such\n(cid:98)H of the RKHS H generated by the kernel. Here,\n\napproximations are listed below.\nDictionary of atoms Let \u2126 be compact. Given points C = {c1, . . . , cM}, ci \u2208 \u2126, we have a\ndictionary of atoms FC = {\u03c8(c1),\u00b7\u00b7\u00b7 , \u03c8(cM )}, \u03c8(ci) \u2208 H, the span of which is a strict subspace\n\n(cid:98)\u03c8i(x) := (cid:104)\u03c8(x), \u03c8(ci)(cid:105)H = k(x, ci)\n\nRandom Fourier features Let \u2126 \u2282 Rn be compact, and let k(x, y) = e\u2212(cid:107)x\u2212y(cid:107)2/2\u03c32 be the\n\n(2)\nLow-rank approximations Let \u2126 be compact, let C = {c1, . . . , cM}, ci \u2208 \u2126, and let K \u2208 RM\u00d7M ,\nKij := k(ci, cj) be the Gram matrix computed from C. This matrix can be diagonalized to compute\n\napproximations ((cid:98)\u03bbi,(cid:98)\u03c6i(x)) of the eigenvalues and eigenfunctions (\u03bbi, \u03c6i(x)) of the kernel [18].\nThese spectral quantities can then be used to compute (cid:98)\u03c8i(x) :=\nGaussian RBF kernel. Then random Fourier features approximate the kernel feature map as (cid:98)\u03c8\u03c9 :\n\u2126 \u2192 (cid:98)H, where \u03c9 is a sample from the Fourier transform of k(x, y), with the property that k(x, y) =\nE\u03c9[(cid:104)(cid:98)\u03c8\u03c9(x),(cid:98)\u03c8\u03c9(y)(cid:105)(cid:98)H] [13]. In this case, if V \u2208 RM/2\u00d7n is a random matrix representing the sample\n\u03c9, then (cid:98)\u03c8i(x) := [\nIn the approximate space case, we replace the transition operator A : H \u2192 H in (1) by (cid:98)A : (cid:98)H \u2192 (cid:98)H.\n\nsin([V x]i), 1\u221a\nsymmetric and dot product kernels.\n\ncos([V x]i) ]. Similar approximations exist for other radially\n\n(cid:112)(cid:98)\u03bbi(cid:98)\u03c6i(x).\n\n1\u221a\nM\n\nM\n\ny\u03c4 = Kw\u03c4 + \u03b6\u03c4 ,\n\nw\u03c4 +1 = (cid:98)Aw\u03c4 + \u03b7\u03c4 ,\n\nThis approximate regime, which trades off the \ufb02exibility of a truly nonparametric approach for\ncomputational realizability, still allows for the representation of rich phenomena, as will be seen in\nthe sequel. The \ufb01nite-dimensional evolution equations approximating (1) in dual form are\n\nwhere we have matrices (cid:98)A \u2208 RM\u00d7M , K \u2208 RN\u00d7M , the vectors w\u03c4 \u2208 RM , and where we have\nslightly abused notation to let \u03b7\u03c4 and \u03b6\u03c4 denote their (cid:98)H counterparts. Here K is the matrix whose\nrows are of the form K(i) = (cid:98)\u03a8(xi) = [ (cid:98)\u03c81(xi) (cid:98)\u03c82(xi) \u00b7\u00b7\u00b7 (cid:98)\u03c8M (xi) ]. In systems-theoretic language,\n(cid:20) K(cid:98)A\u03c41\n(cid:21)\neach row of K corresponds to a measurement at a particular location, and the matrix itself acts as\nK(cid:98)A\u03c4L\na measurement operator. We de\ufb01ne the generalized observability matrix [20] as O\u03a5 =\nwhere \u03a5 = {\u03c41, . . . , \u03c4L} are the set of instances \u03c4i when we apply the operator K. A linear system\nis said to be observable if O\u03a5 has full column rank (i.e. RankO\u03a5 = M) for \u03a5 = {0, 1, . . . , M \u2212 1}\n(cid:3)T , we have that y\u03a5 = O\u03a5w0. Secondly, it guarantees that a feedback based\ny\u03a5 =(cid:2)yT\n[20]. Observability guarantees two critical facts: \ufb01rstly, it guarantees that the state w0 can be\nrecovered exactly from a \ufb01nite series of measurements {y\u03c41, y\u03c42, . . . , y\u03c4L}; in particular, de\ufb01ning\nobserver can be designed such that the estimate of w\u03c4 , denoted by (cid:98)w\u03c4 , converges exponentially fast\nto w\u03c4 in the limit of samples. Note that all our theoretical results assume (cid:98)A is available: while we\n\nperform system identi\ufb01cation in the experiments (Section 3.3), it is not the focus of the paper.\nWe are now in a position to formally state the spatiotemporal modeling and inference problem\nconsidered: given a spatiotemporally evolving system modeled using (3), choose a set of N sensing\nlocations such that even with N (cid:28) M, the functional evolution of the spatiotemporal model can\nbe estimated (which corresponds to monitoring) and can be predicted robustly (which corresponds\nto Bayesian \ufb01ltering). Our approach to solve this problem relies on the design of the measurement\n\noperator K so that the pair (K, (cid:98)A) is observable: any Bayesian state estimator (e.g. a Kalman \ufb01lter)\n(cid:98)A for this task (see \u00a7?? in supplementary for details on spectral decomposition).\n\nutilizing this pair is denoted as a kernel observer 1. We will leverage the spectral decomposition of\n\n,\u00b7\u00b7\u00b7 , y\u03c4 T\n\n(3)\n\n, yT\n\u03c42\n\n\u03c41\n\nL\n\n\u00b7\u00b7\u00b7\n\n2.2 Main Results\nIn this section, we prove results concerning the observability of spatiotemporally varying functions\nmodeled by the functional evolution and measurement equations (3) formulated in Section 2.1. In\n\n1In the case where no measurements are taken, for the sake of consistency, we denote the state estimator as\n\nan autonomous kernel observer, despite this being something of an oxymoron.\n\n4\n\n\flower bound. Finally, since the measurement map does not have the structure of a kernel matrix,\n\nshadedness (De\ufb01nition 1) is suf\ufb01cient for the system to be observable. Proposition 2 provides a\n\nparticular, observability of the system states implies that we can recover the current state of the\nspatiotemporally varying function using a small number of sampling locations N, which allows us to\n1) track the function, and 2) predict its evolution forward in time. We work with the approximation\n\nclaims are in the supplementary material.\nDe\ufb01nition 1. (Shaded Observation Matrix) Given k : \u2126 \u00d7 \u2126 \u2192 R positive-de\ufb01nite on a domain \u2126,\n\n(cid:98)H \u2248 H: given M basis functions, this implies that the dual space of (cid:98)H is RM . Proposition 1 shows\nthat if (cid:98)A has a full-rank Jordan decomposition, the observation matrix K meeting a condition called\nlower bound on the number of sampling locations required for observability which holds for any (cid:98)A.\nProposition 3 constructively shows the existence of an abstract measurement map (cid:101)K achieving this\na slightly weaker suf\ufb01cient condition for the observability of any (cid:98)A is in Theorem 1. Proofs of all\nlet {(cid:98)\u03c81(x), . . . ,(cid:98)\u03c8M (x)} be the set of bases generating an approximate feature map (cid:98)\u03c8 : \u2126 \u2192 (cid:98)H, and\nlet X = {x1, . . . , xN}, xi \u2208 \u2126. Let K \u2208 RN\u00d7M be the observation matrix, where Kij := (cid:98)\u03c8j(xi).\nin the observation matrix row i which are nonzero. Then if(cid:83)\nFor each row K(i) := [ (cid:98)\u03c81(xi) \u00b7\u00b7\u00b7 (cid:98)\u03c8M (xi) ], de\ufb01ne the set I(i) := {\u03b9(i)\n} to be the indices\n1 , \u03b9(i)\ni\u2208{1,...,N} I (i) = {1, 2, . . . , M}, we\nRemark 1. Let (cid:98)\u03c8 be generated by the dictionary given by C = {c1, . . . , cM}, ci \u2208 \u2126. Note that since\n(cid:98)\u03c8j(xi) = (cid:104)\u03c8(xi), \u03c8(cj)(cid:105)H = k(xi, cj), K is the kernel matrix between X and C. For the kernel\n\nThis de\ufb01nition seems quite abstract, so the following remark considers a more concrete example.\n\ndenote K as a shaded observation matrix (see Figure 2a).\n\n2 , . . . , \u03b9(i)\nMi\n\nmatrix to be shaded thus implies that there does not exist an atom \u03c8(cj) such that the projections\n(cid:104)\u03c8(xi), \u03c8(cj)(cid:105)H vanish for all xi, 1 \u2264 i \u2264 N. Intuitively, the shadedness property requires that the\nsensor locations xi are privy to information propagating from every cj. As an example, note that, in\nprinciple, for the Gaussian kernel, a single row generates a shaded kernel matrix2.\n\nProposition 1. Given k : \u2126 \u00d7 \u2126 \u2192 R positive-de\ufb01nite on a domain \u2126, let {(cid:98)\u03c81(x), . . . ,(cid:98)\u03c8M (x)} be\nthe set of bases generating an approximate feature map (cid:98)\u03c8 : \u2126 \u2192 (cid:98)H, and let X = {x1, . . . , xN},\nxi \u2208 \u2126. Consider the discrete linear system on (cid:98)H given by the evolution and measurement equations\n(3). Suppose that a full-rank Jordan decomposition of (cid:98)A \u2208 RM\u00d7M of the form (cid:98)A = P \u039bP \u22121\n\nobtain a lower bound on the number of sampling locations required. Let r be the number of unique\n\nexists, where \u039b = [ \u039b1 \u00b7\u00b7\u00b7 \u039bO ], and there are no repeated eigenvalues. Then, given a set of time\ninstances \u03a5 = {\u03c41, \u03c42, . . . , \u03c4L}, and a set of sampling locations X = {x1, . . . , xN}, the system (3)\nis observable if the observation matrix Kij is shaded according to De\ufb01nition 1, \u03a5 has distinct values,\nand |\u03a5| \u2265 M.\nWhen the eigenvalues of the system matrix are repeated, it is not enough for K to be shaded. In\n\nProposition 2. Suppose that the conditions in Proposition 1 hold, with the relaxation that the Jordan\nblocks [ \u039b1 \u00b7\u00b7\u00b7 \u039bO ] may have repeated eigenvalues (i.e. \u2203\u039bi and \u039bj s.t. \u03bbi = \u03bbj). Then there exist\nkernels k(x, y) such that the lower bound (cid:96) on the number of sampling locations N is given by the\n\nthe next proposition, we take a geometric approach and utilize the rational canonical form of (cid:98)A to\neigenvalues of (cid:98)A, and let \u03b3\u03bbi denote the geometric multiplicity of eigenvalue \u03bbi. Then the cyclic\nindex of (cid:98)A is de\ufb01ned as (cid:96) = max1\u2264i\u2264r \u03b3\u03bbi[19] (see supplementary section ?? for details).\ncyclic index of (cid:98)A.\nWe now show how to construct a matrix (cid:101)K corresponding to the lower bound (cid:96).\nmap (cid:101)K \u2208 R(cid:96)\u00d7M for the system given by (3), such that the pair ((cid:101)K, (cid:98)A) is observable.\nthe rational canonical structure of (cid:98)A to generate a series of vectors vi \u2208 RM , whose iterations\n\nSection ?? in supplementary gives a concrete example to build intuition regarding this lower bound.\n\nProposition 3. Given the conditions stated in Proposition 2, it is possible to construct a measurement\n\nThe construction provided in the proof of Proposition 3 is utilized in Algorithm 1, which uses\n\n2However, in this case, the matrix can have many entries that are extremely close to zero, and will probably\n\nbe very ill-conditioned.\n\n5\n\n\ffor i = 1 to (cid:96) do\n\nend for\nCompute \u02daK = [vT\n\nAlgorithm 1 Measurement Map (cid:101)K\nInput: (cid:98)A \u2208 RM\u00d7M\nCompute rational canonical form, such that C = Q\u22121(cid:98)AT Q. Set C0 := C, and M0 := M.\nObtain MP \u03b1i(\u03bb) of Ci\u22121. This returns associated indices J (i) \u2282 {1, 2, . . . , Mi\u22121}.\nConstruct vector vi \u2208 RM such that \u03bevi (\u03bb) = \u03b1i(\u03bb) .\nUse indices {1, 2, . . . , Mi\u22121} \\ J (i) to select matrix Ci. Set Mi := |{1, 2, . . . , Mi\u22121} \\ J (i)|\nOutput: (cid:101)K = \u02daKQ\u22121\n{v1, . . . , (cid:98)Am1\u22121v1, . . . , v(cid:96), . . . , (cid:98)Am(cid:96)\u22121v(cid:96)} generate a basis for RM . Unfortunately, the measurement\nmap (cid:101)K, being an abstract construction unrelated to the kernel, does not directly select X . We will\n\u00b7\u00b7\u00b7 \u039bO] may have repeated eigenvalues. Let (cid:96) be the cyclic index of (cid:98)A. De\ufb01ne\n\nshow how to use the measurement map to guide a search for X in Remark ??. For now, we state a\nsuf\ufb01cient condition for observability of a general system.\nTheorem 1. Suppose that the conditions in Proposition 1 hold, with the relaxation that the Jordan\nblocks [\u039b1\n\n1 , vT\n\n2 , ..., vT\n\n(cid:96) ]T\n\nK = [ K(1)T \u00b7\u00b7\u00b7 K((cid:96))T ]T\n\n(4)\nas the (cid:96)-shaded matrix which consists of (cid:96) shaded matrices with the property that any subset of (cid:96)\ncolumns in the matrix are linearly independent from each other. Then system (3) is observable if \u03a5\nhas distinct values, and |\u03a5| \u2265 M.\nWhile Theorem 1 is a quite general result, the condition that any (cid:96) columns of K be linearly\nindependent is a very stringent condition. One scenario where this condition can be met with\natoms with the Gaussian RBF kernel evaluated at sampling locations {x1, . . . , xN} according to\n(2), where xi \u2208 \u2126 \u2282 Rd, and xi are sampled from a non-degenerate probability distribution on\n\u2126 such as the uniform distribution. For a semi-deterministic approach, when the dynamics matrix\n\nminimal measurements is in the case when the feature map (cid:98)\u03c8(x) is generated by a dictionary of\n(cid:98)A is block-diagonal, a simple heuristic is given in Remark ?? in the supplementary. Note that in\npractice the matrix (cid:98)A needs to be inferred from measurements of the process f\u03c4 . If no assumptions\nare placed on (cid:98)A, at least M sensors are required for the system identi\ufb01cation phase. Future work will\n\nstudy the precise conditions under which system identi\ufb01cation is possible with less than M sensors.\nFinally, computing the Jordan and rational canonical forms can be computationally expensive: see the\nsupplementary for more details. We note that the crucial step in our approach is computing the cyclic\nindex, which gives us the minimum number of sensors that need to be deployed, the computational\ncomplexity of which is O(M 3). Computation of the canonical forms is required in the case we need\nto strictly realize the lower bound on the number of sensors.\n3 Experimental Results\n3.1 Sampling Locations for Synthetic Data Sets\n\nThe goal of this experiment is to investigate the dependency of the observability of system (3) on\nthe shaded observation matrix and the lower bound presented in Proposition 2. The domain is \ufb01xed\non the interval \u2126 = [0, 2\u03c0]. First, we pick sets of points C(\u03b9) = {c1, . . . , cM\u03b9}, cj \u2208 \u2126, M = 50,\nand construct a dynamics matrix A = \u039b \u2208 RM\u00d7M , with cyclic index 5. We pick the RBF kernel\nk(x, y) = e\u2212(cid:107)x\u2212y(cid:107)2/2\u03c32, \u03c3 = 0.02. Generating samples X = {x1, . . . , xN}, xi \u2208 \u2126 randomly, we\ncompute the (cid:96)-shaded property and observability for this system. Figure 3a shows how shadedness is\na necessary condition for observability, validating Proposition 1: the slight gap between shadedness\nand observability here can be explained due to numerical issues in computing the rank of O\u03a5. Next,\nwe again pick M = 50, but for a system with a cyclic index (cid:96) = 18. We constructed the measurement\nwell as random sampling to generate the sampling locations X . These results are presented in Figure\n3b. The plot for random sampling has been averaged over 100 runs. It is evident from the plot that\n\nmap (cid:101)K using Algorithm 1, and the heuristic in Remark ?? (Algorithm 2 in the supplementary) as\n\n6\n\n\fobservability cannot be achieved for a number of samples N < (cid:96). Clearly, the heuristic presented\noutperforms random sampling; note however, that our intent is not to compare the heuristic against\nrandom sampling, but to show that the lower bound (cid:96) provides decisive guidelines for selecting the\nnumber of samples while using the computationally ef\ufb01cient random approach.\n3.2 Comparison With Nonstationary Kernel Methods on Real-World Data\n\nWe use two real-world datasets to evaluate and compare the kernel observer with the two different\nlines of approach for non-stationary kernels discussed in Section 1.1. For the Process Convolution\nwith Local Smoothing Kernel (PCLSK) and Latent Extension of Input Space (LEIS) approaches, we\ncompare with NOSTILL-GP [4] and [11] respectively, on the Intel Berkeley and Irish Wind datasets.\nModel inference for the kernel observer involved three steps: 1) picking the Gaussian RBF kernel\nk(x, y) = e\u2212(cid:107)x\u2212y(cid:107)2/2\u03c32, a search for the ideal \u03c3 is performed for a sparse Gaussian Process model\n(with a \ufb01xed basis vector set C selected using the method in [3]. For the data set discussed in this\nsection, the number of basis vectors were equal to the number of sensing locations in the training\nset, with the domain for input set de\ufb01ned over R2; 2) having obtained \u03c3, Gaussian process inference\nis used to generate weight vectors for each time-step in the training set, resulting in the sequence\n\nw\u03c4 , \u03c4 \u2208 {1, . . . , T}; 3) matrix least-squares is applied to this sequence to infer (cid:98)A (Algorithm 3 in\nthe supplementary). For prediction in the autonomous setup, (cid:98)A is used to propagate the state w\u03c4\n\nforward to make predictions with no feedback, and in the observer setup, a Kalman \ufb01lter (Algorithm\n4 in the supplementary) with N determined using Proposition 2, and locations picked randomly, is\nused to propagate w\u03c4 forward to make predictions. We also compare with a baseline GP (denoted by\n\u2018original GP\u2019), which is the sparse GP model trained using all of the available data.\nOur \ufb01rst dataset, the Intel Berkeley research lab temperature data, consists of 50 wireless temperature\nsensors in indoor laboratory region spanning 40.5 meters in length and 31 meters in width3. Training\ndata consists of temperature data on March 6th 2004 at intervals of 20 minutes (beginning 00:20\nhrs) which totals to 72 timesteps. Testing is performed over another 72 timesteps beginning 12:20\nhrs of the same day. Out of 50 locations, we uniformly selected 25 locations each for training and\ntesting purposes. Results of the prediction error are shown in box-plot form in Figure 4a and as a\n\ntime-series in Figure 4b, note that \u2018Auto\u2019 refers to autonomous set up. Here, the cyclic index of (cid:98)A\n\nwas determined to be 2, so N was set to 2 for the kernel observer with feedback. Note that here, even\nthe autonomous kernel observer outperforms PCLSK and LEIS overall, and the kernel observer with\nfeedback N = 2 does so signi\ufb01cantly, which is why we did not include results with N > 2.\nThe second dataset is the Irish wind dataset, consisting of daily average wind speed data collected\nfrom year 1961 to 1978 at 12 meteorological stations in the Republic of Ireland4. The prediction\n\nerror is in box-plot form in Figure 5a and as a time-series in Figure 5b. Again, the cyclic index of (cid:98)A\n\nwas determined to be 2. In this case, the autonomous kernel observer\u2019s performance is comparable to\nPCLSK and LEIS, while the kernel observer with feedback with N = 2 again outperforms all other\nmethods. Table ?? in the supplementary reports the total training and prediction times associated\nwith PCLSK, LEIS, and the kernel observer. We observed that 1) the kernel observer is an order of\nmagnitude faster, and 2) even for small sets, competing methods did not scale well.\n3.3 Prediction of Global Ocean Surface Temperature\n\nWe analyzed the feasibility of our approach on a large dataset from the National Oceanographic\nData Center: the 4 km AVHRR Path\ufb01nder project, which is a satellite monitoring global ocean\nsurface temperature (Fig. 6a). This dataset is challenging, with measurements at over 37 million\npossible coordinates, but with only around 3-4 million measurements available per day, leading to\na lot of missing data. The goal was to learn the day and night temperature models on data from\nthe year 2011, and to monitor thereafter for 2012. Success in monitoring would demonstrate two\nthings: 1) the modeling process can capture spatiotemporal trends that generalize across years, and\n2) the observer framework allows us to infer the state using a number of measurements that are\nan order of magnitude fewer than available. Note that due to the size of the dataset and the high\ncomputational requirements of the nonstationary kernel methods, a comparison with them was not\npursued. To build the autonomous kernel observer and general kernel observer models, we followed\nthe same procedure outlined in Section 3.2, but with C = {c1, . . . , cM}, cj \u2208 R2, |C| = 300. Cyclic\n\n3http://db.csail.mit.edu/labdata/labdata.html\n4http://lib.stat.cmu.edu/datasets/wind.desc\n\n7\n\n\f(a) Shaded vs. observability (b) Heuristic vs. random\n\nFigure 3: Kernel observability results.\n\n(a) Error (boxplot)\n\n(b) Error (time-series)\n\nFigure 4: Comparison of kernel observer to\nPCLSK and LEIS methods on Intel dataset.\n\n(a) Error (boxplot)\n\n(a) AVHHR estimate\n\n(b) Error-day (time-series) (c) Error-night (time-series)\n\n(d) Error-day (boxplot)\n\n(e) Error-night (boxplot)\n\n(f) Estimation time (day)\n\nindex of (cid:98)A was determined to be 250 and hence the Kalman \ufb01lter for kernel observer model using\n\nFigure 6: Performance of the kernel observer over AVVHR satellite\n2011-12 data with different numbers of observation locations.\n\n(b) Error (time-series)\nFigure 5: Irish Wind\n\nN \u2208 {250, 500, 1000} at random locations was utilized to track the system state given a random\ninitial condition w0. As a fair baseline, the observers are compared to training a sparse GP model\n(labeled \u2018original\u2019) on approximately 400, 000 measurements per day. Figures 6b and 6c compare the\nautonomous and feedback approach with 1, 000 samples to the baseline GP; here, it can be seen that\nthe autonomous does well in the beginning, but then incurs an unacceptable amount of error when\nthe time series goes into 2012, i.e. where the model has not seen any training data, whereas KO does\nwell throughout. Figures 6d and 6e show a comparison of the RMS error of estimated values from\nthe real data. This \ufb01gure shows the trend of the observer getting better state estimates as a function of\nthe number of sensing locations N 5. Finally, the prediction time of KO is much less than retraining\nthe model every time step, as shown in Figure 6f.\n4 Conclusions\nThis paper presented a new approach to the problem of monitoring complex spatiotemporally evolving\nphenomena with limited sensors. Unlike most Neural Network or Kernel based models, the presented\napproach inherently incorporates differential constraints on the spatiotemporal evolution of the mixing\nweights of a kernel model. In addition to providing an elegant and ef\ufb01cient model, the main bene\ufb01t\nof the inclusion of the differential constraint in the model synthesis is that it allowed the derivation\nof fundamental results concerning the minimum number of sampling locations required, and the\nidenti\ufb01cation of correlations in the spatiotemporal evolution, by building upon the rich literature in\nsystems theory. These results are non-conservative, and as such provide direct guidance in ensuring\nrobust real-world predictive inference with distributed sensor networks.\n\nAcknowledgment\n\nThis work was supported by AFOSR grant #FA9550-15-1-0146.\n\n5Note that we checked the performance of training a GP with only 1, 000 samples as a control, but the\n\naverage error was about 10 Kelvins, i.e. much worse than KO.\n\n8\n\n102030405000.20.40.60.811.2SamplesPercentage Observable obs.shaded102030405000.20.40.60.811.2SamplesPercentage Observable RandomHeuristic11.522.533.544.55OriginalAutoObserverPCLSKLEISRMS Error in Temperature (oC)0204060800123456TimestepsRMS Error in Temperature (oC) OriginalAutonomousKernel Observer N = 2PCLSKLEIS0246810OriginalAutoObserverPCLSKLEISRMS Error in Wind Speed (knots)010203040024681012TimestepsRMS Error in Wind Speed (knots) OriginalAutonomousKernel Observer N = 2PCLSKLEIS 270275280285290295300305310Jul 11Sep 11Nov 11Jan 12Mar 12May 122468101214161820TimestepsRMS Error in Temperature (K) OriginalAutonomousKO (N = 1000)Jul 11Sep 11Nov 11Jan 12Mar 12May 122468101214161820TimestepsRMS Error in Temperature (K) OriginalAutonomousKO (N = 1000)2468101214161820OriginalAutoN=250N=500N=1000RMS Error in Temperature (K)2468101214161820OriginalAutoN=250N=500N=1000RMS Error in Temperature (K)0123456Original AutoN=250N=500N=1000Training Time (seconds)\fReferences\n[1] Haim Brezis. Functional analysis, Sobolev spaces and partial differential equations. Springer\n\nScience & Business Media, 2010.\n\n[2] Noel Cressie and Christopher K Wikle. Statistics for spatio-temporal data. John Wiley & Sons,\n\n2011.\n\n[3] Lehel Csat\u00f6 and Manfred Opper. Sparse on-line gaussian processes. Neural computation,\n\n14(3):641\u2013668, 2002.\n\n[4] Sahil Garg, Amarjeet Singh, and Fabio Ramos. Learning non-stationary space-time models for\nenvironmental monitoring. In Proceedings of the Twenty-Sixth AAAI Conference on Arti\ufb01cial\nIntelligence, July 22-26, 2012, Toronto, Ontario, Canada., 2012.\n\n[5] David Higdon. A process-convolution approach to modelling temperatures in the north atlantic\n\nocean. Environmental and Ecological Statistics, 5(2):173\u2013190, 1998.\n\n[6] Sadeep Jayasumana, Richard Hartley, Mathieu Salzmann, Hongdong Li, and Mehrtash Harandi.\nKernel Methods on Riemannian Manifolds with Gaussian RBF Kernels. IEEE Transactions on\nPattern Analysis and Machine Intelligence (TPAMI), 2015.\n\n[7] Chunsheng Ma. Nonstationary covariance functions that model space\u2013time interactions. Statis-\n\ntics & Probability Letters, 61(4):411\u2013419, 2003.\n\n[8] Kanti V Mardia, Colin Goodall, Edwin J Redfern, and Francisco J Alonso. The kriged kalman\n\n\ufb01lter. Test, 7(2):217\u2013282, 1998.\n\n[9] C Paciorek and M Schervish. Nonstationary covariance functions for gaussian process regression.\n\nAdvances in neural information processing systems, 16:273\u2013280, 2004.\n\n[10] Fernando P\u00e9rez-Cruz, Steven Van Vaerenbergh, Juan Jos\u00e9 Murillo-Fuentes, Miguel L\u00e1zaro-\nGredilla, and Ignacio Santamaria. Gaussian processes for nonlinear signal processing: An\noverview of recent advances. Signal Processing Magazine, IEEE, 30(4):40\u201350, 2013.\n\n[11] Tobias P\ufb01ngsten, Malte Kuss, and Carl Edward Rasmussen. Nonstationary gaussian process\n\nregression using a latent extension of the input space, 2006.\n\n[12] Christian Plagemann, Kristian Kersting, and Wolfram Burgard. Nonstationary gaussian process\nregression using point estimates of local smoothness. In Machine learning and knowledge\ndiscovery in databases, pages 204\u2013219. Springer, 2008.\n\n[13] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In NIPS,\n\npages 1177\u20131184, 2007.\n\n[14] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning.\n\nThe MIT Press, December 2005.\n\n[15] Alexandra M Schmidt and Anthony O\u2019Hagan. Bayesian inference for non-stationary spatial\ncovariance structure via spatial deformations. Journal of the Royal Statistical Society: Series B\n(Statistical Methodology), 65(3):743\u2013758, 2003.\n\n[16] Amarjeet Singh, Fabio Ramos, H Durrant-Whyte, and William J Kaiser. Modeling and decision\nmaking in spatio-temporal processes for environmental surveillance. In Robotics and Automation\n(ICRA), 2010 IEEE International Conference on, pages 5490\u20135497. IEEE, 2010.\n\n[17] Christopher K Wikle. A kernel-based spectral model for non-gaussian spatio-temporal processes.\n\nStatistical Modelling, 2(4):299\u2013314, 2002.\n\n[18] Christopher Williams and Matthias Seeger. Using the Nystr\u00f6m method to speed up kernel\n\nmachines. In NIPS, pages 682\u2013688, 2001.\n\n[19] W Murray Wonham. Linear multivariable control. Springer, 1974.\n\n[20] Kemin Zhou, John C. Doyle, and Keith Glover. Robust and Optimal Control. Prentice Hall,\n\nUpper Saddle River, NJ, 1996.\n\n9\n\n\f", "award": [], "sourceid": 1988, "authors": [{"given_name": "Hassan", "family_name": "Kingravi", "institution": "Pindrop Security Services"}, {"given_name": "Harshal", "family_name": "Maske", "institution": "UIUC"}, {"given_name": "Girish", "family_name": "Chowdhary", "institution": "UIUC"}]}