{"title": "Solving inverse problem of Markov chain with partial observations", "book": "Advances in Neural Information Processing Systems", "page_first": 1655, "page_last": 1663, "abstract": "The Markov chain is a convenient tool to represent the dynamics of complex systems such as traffic and social systems, where probabilistic transition takes place between internal states. A Markov chain is characterized by initial-state probabilities and a state-transition probability matrix. In the traditional setting, a major goal is to figure out properties of a Markov chain when those probabilities are known. This paper tackles an inverse version of the problem: we find those probabilities from partial observations at a limited number of states. The observations include the frequency of visiting a state and the rate of reaching a state from another. Practical examples of this task include traffic monitoring systems in cities, where we need to infer the traffic volume on every single link on a road network from a very limited number of observation points. We formulate this task as a regularized optimization problem for probability functions, which is efficiently solved using the notion of natural gradient. Using synthetic and real-world data sets including city traffic monitoring data, we demonstrate the effectiveness of our method.", "full_text": "Solving inverse problem of Markov chain\n\nwith partial observations\n\nTetsuro Morimura\nIBM Research - Tokyo\n\ntetsuro@jp.ibm.com\n\nTakayuki Osogami\nIBM Research - Tokyo\n\nosogami@jp.ibm.com\n\nTsuyoshi Id\u00b4e\n\nIBM T.J. Watson Research Center\n\ntide@us.ibm.com\n\nAbstract\n\nThe Markov chain is a convenient tool to represent the dynamics of complex sys-\ntems such as traf\ufb01c and social systems, where probabilistic transition takes place\nbetween internal states. A Markov chain is characterized by initial-state proba-\nbilities and a state-transition probability matrix. In the traditional setting, a major\ngoal is to study properties of a Markov chain when those probabilities are known.\nThis paper tackles an inverse version of the problem: we \ufb01nd those probabilities\nfrom partial observations at a limited number of states. The observations include\nthe frequency of visiting a state and the rate of reaching a state from another. Prac-\ntical examples of this task include traf\ufb01c monitoring systems in cities, where we\nneed to infer the traf\ufb01c volume on single link on a road network from a limited\nnumber of observation points. We formulate this task as a regularized optimiza-\ntion problem, which is ef\ufb01ciently solved using the notion of natural gradient. Us-\ning synthetic and real-world data sets including city traf\ufb01c monitoring data, we\ndemonstrate the effectiveness of our method.\n\n1 Introduction\n\nThe Markov chain is a standard model for analyzing the dynamics of stochastic systems, including\neconomic systems [29], traf\ufb01c systems [11], social systems [12], and ecosystems [6]. There is a large\nbody of the literature on the problem of analyzing the properties a Markov chain given its initial\ndistribution and a matrix of transition probabilities [21, 26]. For example, there exist established\nmethods for analyzing the stationary distribution and the mixing time of a Markov chain [23, 16].\nIn these traditional settings, the initial distribution and the transition-probability matrix are given a\npriori or directly estimated.\nUnfortunately, it is often impractical to directly measure or estimate the parameters (i.e., the initial\ndistribution and the transition-probability matrix) of the Markov chain that models a particular sys-\ntem under consideration. For example, one can analyze a traf\ufb01c system [27, 24], including how the\nvehicles are distributed across a city, by modeling the dynamics of vehicles as a Markov chain [11].\nIt is, however, dif\ufb01cult to directly measure the fraction of the vehicles that turns right or left at every\nintersection.\nThe inverse problem of a Markov chain that we address in this paper is an inverse version of the\ntraditional problem of analyzing a Markov chain with given input parameters. Namely, our goal is\nto estimate the parameters of a Markov chain from partial observations of the corresponding system.\nIn the context of the traf\ufb01c system, for example, we seek to \ufb01nd the parameters of a Markov chain,\ngiven the traf\ufb01c volumes at stationary observation points and/or the rate of vehicles moving between\n\n1\n\n\fFigure 1: An inverse Markov chain problem. The traf\ufb01c volume on every road is inferred from\ntraf\ufb01c volumes at limited observation points and/or the rates of vehicles transitioning between these\npoints.\n\nthese points. Such statistics can be reliably estimated from observations with web-cameras [27],\nautomatic number plate recognition devices [10], or radio-frequency identi\ufb01cation (RFID) [25],\nwhose availability is however limited to a small number of observation points in general (see Figure\n1). By estimating the parameters of a Markov chain and analyzing its stationary probability, one can\ninfer the traf\ufb01c volumes at unobserved points.\nThe primary contribution of this paper is the \ufb01rst methodology for solving the inverse problem of\na Markov chain when only the observation at a limited number of stationary observation points are\ngiven. Speci\ufb01cally, we assume that the frequency of visiting a state and/or the rate of reaching a\nstate from another are given for a small number of states. We formulate the inverse problem of a\nMarkov chain as a regularized optimization problem. Then we can ef\ufb01ciently \ufb01nd a solution to the\ninverse problem of a Markov chain based on the notion of natural gradient [3].\nThe inverse problem of a Markov chain has been addressed in the literature [9, 28, 31], but the\nexisting methods assume that sample paths of the Markov chain are available. Related work of\ninverse reinforcement learning [20, 1, 32] also assumes that sample paths are available.\nIn the\ncontext of the traf\ufb01c system, the sample paths corresponds to probe-car data (i.e., sequence of GPS\npoints). However, the probe-car data is expensive and rarely available in public. Even when it is\navailable, it is often limited to vehicles of a particular type such as taxis or in a particular region. On\nthe other hand, stationary observation data is often less expensive and more obtainable. For instance,\nweb-camera images are available even in developing countries such as Kenya [2].\nThe rest of this paper is organized as follows. In Section 2, preliminaries are introduced. In Section\n3, we formulate an inverse problem of a Markov chain as a regularized optimization problem. A\nmethod for ef\ufb01ciently solving the inverse problem of a Markov chain is proposed in Section 4. An\nexample of implementation is provided in Section 5. Section 6 evaluates the proposed method with\nboth arti\ufb01cial and real-world data sets including the one from traf\ufb01c monitoring in a city.\n\n2 Preliminaries\n\nA discrete-time Markov chain [26; 21] is a stochastic process, X = (X0; X1; : : : ), where Xt\nis a random variable representing the state at time t 2 Z(cid:21)0. A Markov chain is de\ufb01ned by\nthe triplet fX ; pI; pTg, where X = f1; : : : ;jXjg is a \ufb01nite set of states, where jXj (cid:21) 2 is\nthe number of states. The function, pI : X ! [0; 1], speci\ufb01es the initial-state probability, i.e.,\npI(x) \u225c Pr(X0 = x), and pT : X (cid:2) X ! [0; 1] speci\ufb01es the state transition probability from x to\n\u2032 j Xt = x); 8t 2 Z(cid:21)0. Note the state transition is conditionally\n\u2032, i.e., pT(x\nx\nindependent of the past states given the current state, which is called the Markov property.\nAny Markov chain can be converted into another Markov chain, called a Markov chain with restart,\nby modifying the transition probability. There, the initial-state probability stays unchanged, but the\nstate transition probability is modi\ufb01ed into p such that\n\n\u2032 j x) \u225c Pr(Xt+1 = x\n\n\u2032 j x) \u225c (cid:12)pT(x\n\n\u2032 j x) + (1 (cid:0) (cid:12))pI(x\n\u2032\n\np(x\n\n(1)\nwhere (cid:12) 2 [0; 1) is a continuation rate of the Markov chain1. In the limit of (cid:12) ! 1, this Markov\nchain with restart is equivalent to the original Markov chain. In the following, we refer to p as the\n(total) transition probability, while pT as a partial transition (or p-transition) probability.\n\n);\n\n1The rate (cid:12) can depend on the current state x so that (cid:12) can be replaced with (cid:12)(x) throughout the paper. For\n\nreadability, we assume (cid:12) is a constant.\n\n2\n\n(cid:1)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:12)(cid:1)(cid:10)(cid:8)(cid:5)(cid:1)(cid:10)(cid:13)(cid:9)(cid:14)(cid:14)(cid:3)(cid:15)(cid:1)(cid:4)(cid:16)(cid:17)(cid:18)(cid:19)(cid:5)(cid:1)(cid:20)(cid:21)(cid:16)(cid:6)(cid:1)(cid:5)(cid:9)(cid:15)(cid:8)(cid:1)(cid:18)(cid:6)(cid:16)(cid:22)(cid:12)(cid:5)(cid:13)(cid:4)(cid:5)(cid:23)(cid:1)(cid:17)(cid:3)(cid:6)(cid:24)(cid:25)\fOur main targeted applications are (massive) multi-agent systems such as traf\ufb01c systems. So, restart-\ning a chain means that an agent\u2019s origin of a trip is decided by the initial distribution, and the trip\nends at each time-step with probability 1 (cid:0) (cid:12).\nWe model the initial probability and p-transition probability with parameters (cid:23) 2 Rd1 and ! 2 Rd2,\nrespectively, where d1 and d2 are the numbers of those parameters. So we will denote those as pI(cid:23)\nand pT!, respectively, and the total transition probability as p(cid:18), where (cid:18) is the total model parameter,\n(cid:0)1((cid:12)) with the inverse of sigmoid function\n\u22a4\n(cid:18) \u225c [(cid:23)\n(cid:0)1. That is, Eq. (1) is rewritten as\n&\n\n\u22a4 2 Rd where d = d1 +d2 +1 and ~(cid:12) \u225c &\n\n\u22a4\n; !\n\n; ~(cid:12)]\n\n\u2032 j x) \u225c (cid:12)pT!(x\n\np(cid:18)(x\n\n\u2032 j x) + (1 (cid:0) (cid:12))pI(cid:23)(x\n\u2032\n\n):\n\n(2)\n\nThe Markov chain with restart can be represented as M((cid:18)) \u225c fX ; pI(cid:23); pT!; (cid:12)g.\nAlso we make the following assumptions that are standard for the study of Markov chains and their\nvariants [26, 7].\nAssumption 1 The Markov chain M((cid:18)) for any (cid:18) 2 Rd is ergodic (irreducible and aperiodic).\nAssumption 2 The initial probability pI(cid:23) and p-transition probability pT! are differentiable every-\nwhere with respect to (cid:18) 2 Rd.2\nUnder Assumption 1, there exists a unique stationary probability, (cid:25)(cid:18)((cid:1)), which satis\ufb01es the balance\nequation:\n\u2032\n(3)\n(cid:25)(cid:18)(x\nThis stationary probability is equal to the limiting distribution and independent of the initial state:\n\u2032 j X0 = x; M((cid:18))); 8x2X . Assumption 2 indicates that the transition\n\u2032\n(cid:25)(cid:18)(x\n) 2 X (cid:2) X with respect to any (cid:18) 2 Rd.\nprobability p(cid:18) is also differentiable for any state pair (x; x\nFinally we de\ufb01ne hitting probabilities for a Markov chain of inde\ufb01nite-horizon. The Markov chain\nis represented as ~M((cid:18)) = fX ; pT!; (cid:12)g, which evolves according to the p-transition probability pT!,\nnot to p(cid:18), and terminates with a probability 1 (cid:0) (cid:12) at every step. The hitting probability of a state x\n\u2032\ngiven x is de\ufb01ned as\n\n) = limt!1 Pr(Xt = x\n\n\u2032 j x)(cid:25)(cid:18)(x); 8x\n\n\u2032 2 X ;\n\nx2X p(x\n\n\u2211\n\n) =\n\n\u2032\n\n\u2032 j x) \u225c Pr(x\n\n\u2032 2 ~X j X0 = x; ~M((cid:18)));\n\nh(cid:18)(x\n\n(4)\n\nwhere ~X = ( ~X0; : : : ; ~XT ) is a sample path of ~M((cid:18)) until the stopping time, T .\n\n3 Inverse Markov Chain Problem\n\nHere we formulate an inverse problem of the Markov chain M((cid:18)). In the inverse problem, the model\nfamily M 2 fM((cid:18))j (cid:18) 2 Rdg, which may be subject to a transition structure as in the road network,\nis known or given a priori, but the model parameter (cid:18) is unknown. In Section 3.1, we de\ufb01ne inputs\nof the problem, which are associated with functions of the Markov chain. Objective functions for\nthe inverse problem are discussed in Section 3.2.\n\n3.1 Problem setting\n\nThe input and output of our inverse problem of the Markov chain is as follows.\n(cid:15) Inputs are the values measured at a portion of states x 2 Xo, where Xo (cid:26) X and usually jXoj \u226a\njXj. The measured values include the frequency of visiting a state, f (x); x 2 Xo. In addition,\n) 2 Xo (cid:2) Xo,\nthe rate of reaching a state from another, g(x; x\nwhere g(x; x) is equal to 1.\nIn the context of traf\ufb01c monitoring, f (x) denotes the number of vehicles that went through an\n\u2032 in this\nobservation point, x; g(x; x\norder divided by f (x).\n(cid:15) Output is the estimated parameter (cid:18) of the Markov chain M((cid:18)), which speci\ufb01es the total-\n\n) denotes the number of vehicles that went through x and x\n\n), might also be given for (x; x\n\n\u2032\n\n\u2032\n\n\u2032\n\ntransition probability function p(cid:18) in Eq. (2).\n\n2We assume @\n@(cid:18)i\n\nlog pI(cid:23) (x) = 0 when pI(cid:23) (x) = 0, and an analogous assumption applies to pT!.\n\n3\n\n\fThe \ufb01rst step of our formulation is to relate f and g to the Markov chain. Speci\ufb01cally, we assume\nthat the observed f is proportional to the true stationary probability of the Markov chain:\n\n(5)\nwhere c is an unknown constant to satisfy the normalization condition. We further assume that the\nobserved reaching rate is equal to the true hitting probability of the Markov chain:\n\n(x) = cf (x);\n\n(cid:25)\n\nx 2 Xo;\n\n(cid:3)\n\n(cid:3)\nh\n\n\u2032 j x) = g(x; x\n(x\n\n\u2032\n\n);\n\n(x; x\n\n\u2032\n\n) 2 Xo (cid:2) Xo:\n\n3.2 Objective function\n\nOur objective is to \ufb01nd the parameter (cid:18)\nEqs. (5) and (6). We use the following objective function to be minimized,\n\n(cid:3) such that (cid:25)(cid:18)(cid:3) and h(cid:18)(cid:3) well approximate (cid:25)\n\nL((cid:18)) \u225c (cid:13)Ld((cid:18)) + (1 (cid:0) (cid:13))Lh((cid:18)) + (cid:21)R((cid:18));\n\n(7)\n(cid:3) and h\n(cid:3),\nwhere Ld and Lh are cost functions with respect to the quality of the approximation of (cid:25)\nrespectively. These are speci\ufb01ed in the following subsections. The function R((cid:18)) is the regular-\n2 or jj(cid:18)jj1. The parameters (cid:13) 2 [0; 1] and (cid:21) (cid:21) 0 balance these cost\nization term of (cid:18), such as jj(cid:18)jj2\nfunctions and the regularization term, which will be optimized by cross-validation. Altogether, our\nproblem is to \ufb01nd the parameter, (cid:18)\n\n= arg min(cid:18)2Rd L((cid:18)):\n\n(cid:3)\n\n(6)\n\n(cid:3) and h\n\n(cid:3) in\n\n3.2.1 Cost function for stationary probability function\n\n\u2211\n\n\u2211\n\n\u2211\n\n(cid:3)\n\nx2Xo\n\nBecause the constant c in Eq. (5) is unknown, for example, we cannot minimize a squared error\n(x) (cid:0) (cid:25)(cid:18)(x))2. Thus, we need to derive an alternative cost function of (cid:25)(cid:18) that is\nsuch as\n((cid:25)\nindependent of c.\nFor Ld((cid:18)), one natural choice might be a Kullback-Leibler (KL) divergence,\n\nd ((cid:18)) \u225c\nLKL\n\n(cid:3)\n\n(cid:25)\n\n(x) log\n\n(cid:3)\n\n(x)\n(cid:25)\n(cid:25)(cid:18)(x)\n\n= (cid:0)c\n\nf (x) log (cid:25)(cid:18)(x) + o;\n\nx2Xo\n\n\u2211\n\nx2Xo\nwhere o is a term independent of (cid:18). The minimizer of LKL\nminimization of LKL\n\u2211\n); 8x; x\nincreasing\nthat, because of\neffect of overvaluing\n\nd will lead to a biased estimate. This is because LKL\n(cid:25)(cid:18)(x) when the ratios (cid:25)(cid:18)(x)=(cid:25)(cid:18)(x\nx2Xo\n\nx2(XnXo)(cid:25)(cid:18)(x) = 1; minimizing LKL\n\n(cid:25)(cid:18)(x) and undervaluing\n\nd ((cid:18)) is independent of c. However,\nd will be decreased by\n\u2032 2 Xo are unchanged. This implies\nhas an unwanted side-\n\n\u2211\n\n\u2211\n\nx2Xo\n\n(cid:25)(cid:18)(x) +\nx2Xo\n\nx2(XnXo)(cid:25)(cid:18)(x).\n\nd\n\n\u2032\n\nHere we propose an alternative form of Ld that can avoid this side-effect. It uses a logarithmic ratio\nof the stationary probabilities such that\n\n\u2211\n\u2211\n\n(\n\n\u2211\n\n)2\n\n\u2211\n\n\u2211\n\n(\n\nLd((cid:18)) \u225c 1\n2\n\nlog\n\ni2Xo\n\nj2Xo\n\n(cid:3)\n(cid:25)\n(i)\n(cid:25)(cid:3)(j)\n\n(cid:0) log\n\n(cid:25)(cid:18)(i)\n(cid:25)(cid:18)(j)\n\n=\n\n1\n2\n\nlog\n\ni2Xo\n\nj2Xo\n\n(cid:0) log\n\nf (i)\nf (j)\n\n(cid:25)(cid:18)(i)\n(cid:25)(cid:18)(j)\n\n(8)\n\n(cid:3)\n\nThe log-ratio of probabilities represents difference of information contents between these probabil-\nities in the sense of information theory [17]. Thus this function can be regarded as a sum of squared\n(x) and (cid:25)(cid:18)(x) over x 2 Xo with respect to relative information contents. In a\nerror between (cid:25)\ndifferent point of view, Eq. (8) follows from maximizing the likelihood of (cid:18) under the assumption\nthat the observation \u201clog f (i) (cid:0) log f (j)\u201d has a Gaussian white noise N (0; \u03f52). This assumption\np\nis satis\ufb01ed when f (i) has a log-normal distribution, LN ((cid:22)i; (\u03f5=\n2)2), independently for each i,\nwhere (cid:22)i is the true location parameter, and the median of f (i) is equal to e(cid:22)i.\n\n3.2.2 Cost function for hitting probability function\n\n)2\n\nUnlike Ld((cid:18)), there are several options for Lh((cid:18)). Examples of this cost function include a mean\nsquared error and mean absolute error. Here we use the following standard squared errors in the log\nspace, based on Eq. (6),\n\nlog g(i; j) (cid:0) log h(cid:18)(j j i)\n\n:\n\n(9)\n\n\u2211\n\n\u2211\n\n(\n\ni2Xo\n\nj2Xo\n\nLh((cid:18)) \u225c 1\n2\n\n)2\n\nEq. (9) follows from maximizing the likelihood of (cid:18) under the assumption that the observation\nlog g(i; j) has a Gaussian white noise, as with the case of Ld((cid:18)).\n\n4\n\n\f4 Gradient-based Approach\n\n(cid:0)1\n(cid:18)t\n\nLet us consider (local) minimization of the objective function L((cid:18)) in Eq. (7). We adopt a gradient-\ndescent approach for the problem, where the parameter (cid:18) is optimized by the following iteration,\nwith the notation \u2207(cid:18)L((cid:18)) \u225c [@L((cid:18))=@(cid:18)1; : : : ; @L((cid:18))=@(cid:18)d]\n\u22a4,\n\n(cid:18)t+1 = (cid:18)t (cid:0) (cid:17)tG\n\nf(cid:13)\u2207(cid:18)Ld((cid:18)t) + (1 (cid:0) (cid:13))\u2207(cid:18)Lh((cid:18)t) + (cid:21)\u2207(cid:18)R((cid:18)t)g ;\n\n(10)\n2 Rd(cid:2)d, called the metric of the parameter (cid:18), is\nwhere (cid:17)t > 0 is an updating rate. The matrix G(cid:18)t\nan arbitrary bounded positive de\ufb01nite matrix. When G(cid:18)t is set to the identity matrix of size d, Id,\nthe update formula in Eq. (10) becomes an ordinary gradient descent. However, since the tangent\nspace at a point of a manifold representing M((cid:18)) is generally different from an orthonormal space\nwith respect to (cid:18) [4], one can apply the idea of natural gradient [3] to the metric G(cid:18), expecting to\nmake the procedure more ef\ufb01cient. This is described in Section 4.1.\nThe gradients of Ld and Lh in Eq. (10) are given as\n\n(\n(\n\n\u2211\n\u2211\n\ni2Xo\n\n\u2211\n\u2211\n\nj2Xo\n\ni2Xo\n\nj2Xo\n\n\u2207(cid:18)Ld((cid:18)) =\n\u2207(cid:18)Lh((cid:18)) =\n\nlog\n\n(cid:0) log\n\nf (i)\nf (j)\n\n(cid:25)(cid:18)(i)\n(cid:25)(cid:18)(j)\n\nlog g(i; j) (cid:0) log h(cid:18)(j j i)\n\n)\n\u2207(cid:18) log (cid:25)(cid:18)(j) (cid:0) \u2207(cid:18) log (cid:25)(cid:18)(i)\n\n)(\n)\u2207(cid:18) log h(cid:18)(j j i):\n\n;\n\nIn order to implement the update rule of Eq. (10), we need to compute the gradient of the logarithmic\nstationary probability \u2207(cid:18) log (cid:25)(cid:18), the hitting probability h(cid:18), and its gradient \u2207(cid:18) h(cid:18). In Sections 4.2,\nwe will describe how to compute them, which will turn out to be quite non-trivial.\n\n4.1 Natural gradient\nUsually, a parametric family of Markov chains, M(cid:18) \u225c fM((cid:18)) j (cid:18) 2 Rdg, forms a manifold struc-\nture with respect to the parameter (cid:18) under information divergences such as a KL divergence, instead\nof the Euclidean structure. Thus the ordinary gradient, Eq. (10) with G(cid:18) = Id, does not properly\nre\ufb02ect the differences in the sensitivities and the correlations between the elements of (cid:18). Accord-\ningly, the ordinary gradient is generally different from the steepest direction on the manifold, and\nthe optimization process with the ordinary gradient often becomes unstable or falls into a learning\nplateau [5].\nFor ef\ufb01cient learning, we consider an appropriate G(cid:18) based on the notion of the natural gradient\n(NG) [5]. The NG represents the steepest descent direction of a function b((cid:18)) in a Riemannian\nspace3 by (cid:0)R\n\u2207(cid:18)b((cid:18)) when the Riemannian space is de\ufb01ned by the metric matrix R(cid:18). An appro-\ninformation matrix (FIM):4\u2211\npriate Riemannian metric on a statistical model, Y , having parameters, (cid:18), is known to be its Fisher\n\n(cid:0)1\n(cid:18)\n\ny Pr(Y = y j (cid:18))\u2207(cid:18) log Pr(Y = y j (cid:18))\u2207(cid:18) log Pr(Y = y j (cid:18))\n\u22a4\n\n:\n\n\u2032 2 X , fully speci\ufb01es M((cid:18)) at the steady\nIn our case, the joint probability, p(cid:18)(x\nstate, due to the Markovian property. Thus we propose to use the following G(cid:18) in the update rule of\nEq. (10),\n\n\u2032jx)(cid:25)(cid:18)(x) for x; x\n\n\u2211\n\nG(cid:18) = F(cid:18) + (cid:27)Id;\n\n(\n\u2207(cid:18) log (cid:25)(cid:18)(x)\u2207(cid:18) log (cid:25)(cid:18)(x)\n\u22a4\n\n\u2032jx)(cid:25)(cid:18)(x),\n\n\u2211\n\nx2X\n\n(cid:25)(cid:18)(x)\n\nwhere F(cid:18) is the FIM of p(cid:18)(x\nF(cid:18) \u225c\nThe second term with (cid:27) (cid:21) 0 in Eq. (11) will be needed to make G(cid:18) positive de\ufb01nite.\n3A parameter space is a Riemannian space if the parameter (cid:18) 2 Rd is on a Riemannian manifold de\ufb01ned\nby a positive de\ufb01nite matrix called a Riemannian metric matrix R(cid:18) 2 Rd(cid:2)d. The squared length of a small\nincremental vector \u2206(cid:18) connecting (cid:18) to (cid:18) + \u2206(cid:18) in a Riemannian space is given by \u2225\u2206(cid:18)\u22252\nR(cid:18)\u2206(cid:18).\n4The FIM is the unique metric matrix of the second-order Taylor expansion of the KL divergence, that is,\n\n\u2032jx)\u2207(cid:18) log p(cid:18)(x\n\n\u2032jx)\u2207(cid:18) log p(cid:18)(x\n\n\u22a4\nR(cid:18) = \u2206(cid:18)\n\n\u2032jx)\n\u22a4\n\nx\u20322X\n\np(cid:18)(x\n\n+\n\n:\n\n\u2211\n\ny Pr(Y = y j (cid:18)) log\n\nPr(Y=yj(cid:18))\n\nPr(Y=yj(cid:18)+\u2206(cid:18))\n\n\u2243 1\n\n2\n\n\u2225\u2206(cid:18)\u22252\nF(cid:18) :\n\n(11)\n\n)\n\n5\n\n\f4.2 Computing the gradient\nTo derive an expression for computing \u2207(cid:18) log (cid:25)(cid:18), we use the following notations for a vector and\na matrix: (cid:25)(cid:18) \u225c [(cid:25)(cid:18)(1); : : : ; (cid:25)(cid:18)(jXj)]\n\u2032jx). Then the logarithmic stationary\nprobability gradients with respect to (cid:18)i is given by\nlog (cid:25)(cid:18) \u225c \u2207(cid:18)ilog (cid:25)(cid:18) = Diag((cid:25)(cid:18))\n\n\u22a4 and (P(cid:18))x;x\u2032 \u225c p(cid:18)(x\n\n(cid:0)1(Id (cid:0) P\n\n\u22a4\n\u22a4\n(cid:18) + (cid:25)(cid:18)1\nd )\n\n(cid:0)1(\u2207(cid:18)i P\n\n\u22a4\n(cid:18) )(cid:25)(cid:18);\n\n(12)\n\n@\n@(cid:18)i\n\n\u2211\n\n(cid:0)1 = limK!1\n\n\u22a4\n(cid:18) )(cid:25)(cid:18) + P\n(cid:18) )Diag((cid:25)(cid:18))\u2207(cid:18)i log (cid:25)(cid:18) = (\u2207(cid:18)iP\n\u22a4\n\n\u22a4\n(cid:18) (cid:25)(cid:18): Note that (cid:25)(cid:18) is equal to a normalized eigenvector\n\u22a4\n(cid:18) whose eigenvalue is 1. By taking a partial differential of Eq. (3) with respect to (cid:18)i,\n(cid:18) Diag((cid:25)(cid:18))\u2207(cid:18)i log (cid:25)(cid:18) is obtained. Though we get the\n\u22a4\n\nwhere Diag(a) is a diagonal matrix whose diagonal elements consist of a vector a, log a is the\nelement-wise logarithm of a, and 1d denotes a column-vector of size d, whose elements are all 1.\nIn the remainder of this section, we prove Eq. (12) by using the following proposition.\nProposition 1 ([7]) If A 2 Rd(cid:2)d satis\ufb01es limK!1 AK = 0, then the inverse of (I (cid:0) A) exists,\nand (I (cid:0) A)\nK\nk=0 Ak:\nEquation (3) is rewritten as (cid:25)(cid:18) = P\nof P\nDiag((cid:25)(cid:18))\u2207(cid:18)i log (cid:25)(cid:18) = (\u2207(cid:18)iP\nfollowing linear simultaneous equation of \u2207(cid:18)i log (cid:25)(cid:18),\n(Id (cid:0) P\n\u22a4\n(13)\n(cid:18) )(cid:25)(cid:18);\nthe inverse of (Id(cid:0)P\n(cid:18) )Diag((cid:25)(cid:18)) does not exist. It comes from the fact (Id(cid:0)P\n\u22a4\n\u22a4\n(cid:18) )Diag((cid:25)(cid:18))1d = 0.\nd Diag((cid:25)(cid:18))\u2207(cid:18)i log (cid:25)(cid:18) = 1\n\u2207(cid:18)i (cid:25)(cid:18) = \u2207(cid:18)i\nf1\nd (cid:25)(cid:18)g = 0 to Eq. (13),\n\u22a4\n\u22a4\n\u22a4\nSo we add a term including 1\nsuch that (Id (cid:0) P\nd )Diag((cid:25)(cid:18))\u2207(cid:18)ilog (cid:25)(cid:18) = (\u2207(cid:18)iP\n(cid:18) )(cid:25)(cid:18): The inverse of (Id (cid:0) P\n\u22a4\n\u22a4\n\u22a4\n\u22a4\n\u22a4\nd\n(cid:18) + (cid:25)(cid:18)1\n(cid:18) + (cid:25)(cid:18)1\nd )\n(cid:0) (cid:25)(cid:18)1\n(cid:0) (cid:25)(cid:18)1\n\u22a4\n\u22a4\n\u22a4\nexists, because of Proposition 1 and the fact limk!1(P\nd )k = limk!1 P\nd = 0:\nThe inverse of Diag((cid:25)(cid:18)) also exists, because (cid:25)(cid:18)(x) is positive for any x 2 X under Assumption 1.\n(cid:18)\nHence we get Eq. (12).\nTo derive expressions for computing h(cid:18) and \u2207(cid:18) log h(cid:18), we use the following notations: h(cid:18)(x) \u225c\n\u2032 j x) for\n[h(cid:18)(xj 1); : : : ; h(cid:18)(xjjXj)]\np-transition probabilities in Eq. (1). The hitting probabilities and those gradients with respect to (cid:18)i\ncan be computed as the following closed forms,\nnx\nT(cid:18) )\n\u2207(cid:18)i log h(cid:18)(x) = (cid:12) Diag(h(cid:18)(x))\n\n\u22a4 for the hitting probabilities in Eq. (4) and (PT(cid:18))x;x\u2032 \u225c pT!(x\n\n(15)\nwhere exjXj denotes a column-vector of size jXj, where x\u2019th element is 1 and all of the other elements\njXj)PT(cid:18). We will derive Eqs. (14) and (15) as\nare zero. The matrix P\nfollows. The hitting probabilities in Eq. (4) can be represented as the following recursive form,\n\nT(cid:18) is de\ufb01ned as (IjXj (cid:0) exjXjex\u22a4\nnx\n\n(cid:0)1exjXj;\n(cid:0)1(IjXj (cid:0) (cid:12)P\n\nh(cid:18)(x) = (IjXj (cid:0) (cid:12)P\n\nnx\n(cid:18) )h(cid:18)(x);\n\n(cid:0)1(\u2207(cid:18)iP\n\nnx\nT(cid:18) )\n\n\u22a4k\n(cid:18)\n\n(14)\n\n{\n\n\u2032 j x) =\n\nh(cid:18)(x\n\n\u2211\ny2X pT!(y j x) h(cid:18)(x\n\n1\n(cid:12)\n\n\u2032\nif x\notherwise:\n\n= x\n\n\u2032 j y)\n\nThis equation can be represented with the matrix notation as h(cid:18)(x) = exjXj + (cid:12)P\nthe inverse of (IjXj (cid:0) (cid:12)P\nIn a similar way, one can prove Eq. (15).\n\nnx\nT(cid:18) ) exists by Proposition 1 and limk!1((cid:12)P\n\nnx\nT(cid:18) h(cid:18)(x): Because\nnx\nT(cid:18) )k = 0, we get Eq. (14).\n\n5 Implementation\n\nFor implementing the proposed method, parametric models of the initial probability pI(cid:23) and the p-\ntransition probability pT! in Eq. (1) need to be speci\ufb01ed. We provide intuitive models based on the\nlogit function [8].\nThe initial probability is modeled as\n\n\u2211\n\npI(cid:23)(x) \u225c\n\nexp(sI(x; (cid:23)))\ny2X exp(sI(y; (cid:23)))\n\n;\n\n(16)\n\u22a4 2 Rd1 consisting of\n]\n\nwhere sI(x; (cid:23)) is a state score function with its parameter (cid:23) \u225c [(cid:23)loc\u22a4\na local parameter (cid:23)loc 2 RjXj and a global parameter (cid:23)glo 2 Rd1(cid:0)jXj. It is de\ufb01ned as\n\n; (cid:23)glo\u22a4\n\nsI(x; (cid:23)) \u225c (cid:23)loc\n\n\u22a4\nx + \u03d5I(x)\n\n(cid:23)glo;\n\n(17)\n\n6\n\n\fwhere \u03d5I(x) 2 Rd1(cid:0)jXj is a feature vector of a state x. In the case of the road network, a state\ncorresponds to a road segment. Then \u03d5I(x) may, for example [18], be de\ufb01ned with the indicators of\nwhether there are particular types of buildings near the road segment, x. We refer to the \ufb01rst term\nand the second term of the right-hand side in Eq. (17) as a local preference and a global preference,\nrespectively. If a simpler model is preferred, either of them would be omitted.\n; !glo\u22a4\nSimilarly, a p-transition probability model with the parameter ! \u225c [!loc\u22a4\n\n\u22a4 is given as\n]\n\n{\n\n/\u2211\n\n; !glo\u22a4\n) 2 X (cid:2) Xx;\n\n2\n\n1\n\u2032\nif (x; x\notherwise;\n\n(18)\n\n\u2032jx) \u225c\n\npT!(x\n\n\u2032\n\n; !))\n\nexp(sT(x; x\n0\n\ny2Xx\n\nexp(sT(x; y; !));\n\nwhere Xx is a set of states connected from x, and sT(x; x\nde\ufb01ned as\n\n\u2032\n\n; !) is a state-to-state score function. It is\n\u2032\n\n\u2032\n\n\u2032\n\n\u2032\n\n\u22a4\n)\n\n!glo\n2 ;\n\nsT(x; x\n\n; !) \u225c !loc\n\n\u2032\n(x;x\u2032) + \u03d5T(x\n(x;x\u2032) is the element of !loc (2 R\n\n\u22a4\n\u2211\n!glo\n)\n1 + (x; x\nx2X jXxj) corresponding to transition from x to x\n\u2032\n\n\u2032, and\nwhere !loc\n\u03d5T(x) and (x; x\n) are feature vectors. For the road network, \u03d5T(x) may be de\ufb01ned based on\nthe type of the road segment, x, and (x; x\n) may be de\ufb01ned based on the angle between x and\n\u2032. Those linear combinations with the global parameters, !glo\n2 , can represent drivers\u2019\nx\npreferences such as how much the drivers prefer major roads or straight routes to others.\n\u2032jx) presented in this section can be differentiated analytically.\nNote that the pI(cid:23)(x) and pT!(x\nHence, F(cid:18) in Eq. (11), \u2207(cid:18)i log (cid:25)(cid:18) in Eq. (12), and \u2207(cid:18)ih(cid:18) in Eq. (15) can be computed ef\ufb01ciently.\n\n) 2 X (cid:2) Xx;\n\nand !glo\n\n(x; x\n\n1\n\n6 Experiments\n\n6.1 Experiment on synthetic data\n\n\u2032\n\n\u2032\n\n) for every x; x\n\n\u2032 2 Xo from this synthesized Markov chain.\n\nTo study the sensitivities of the performance of our algorithm to the ratio of observable states, we\napplied it to randomly synthesized inverse problems of 100-state Markov chains with a varying\nnumber of observable states, jXoj 2 f5; 10; 20; 35; 50; 70; 90g. The linkages between states were\nrandomly generated in the same way as [19]. The values of pI and pT are determined in two stages.\nFirst, the basic initial probabilities, pI(cid:23), and the basic transition probabilities, pT!, were determined\nbased on Eqs. (16) and (18), where every element of (cid:23), !, \u03d5I(x), \u03d5T(x), and T(x; x\n) was drawn\nindependently from the normal distribution N (0; 12). Then we added noises to pI(cid:23) and pT!, which\nare ideal for our algorithm, by using the Dirichlet distribution Dir, such that pI = 0:7pI(cid:23) + 0:3(cid:27)\nwith (cid:27) (cid:24) Dir(0:3(cid:2) 1jXj). Then we sampled the visiting frequencies f (x) and the hitting rates\ng(x; x\nIn\nWe used Eqs. (16) and (18) for the models and Eq. (7) for the objective of our method.\nEq. (7), we set (cid:13) = 0:1 and R((cid:18)) = \u2225(cid:18)\u22252\n2, and (cid:21) was determined with a cross-validation. We\nevaluated the quality of our solution with the relative mean absolute error (RMAE), RMAE =\nf (x). As\njXnXoj\na baseline method, we use Nadaraya-Watson kernel regression (NWKR) [8] whose kernel is com-\nputed based on the number of hops in the minimum path between two states. Note that the NWKR\ncould not use g(x; x\n) as an input, because this is a regression problem of f (x). Hence, for a fair\ncomparison, we also applied a variant of our method that does not use g(x; x\nFigure 2 (A) shows the mean and standard deviation of the RMAEs. The proposed method gives\nclearly better performance than the NWKR. This is mainly due to the fact that the NWKR assumes\nthat all propagations of the observation from a link to another connected link are equally weighted.\nIn contrast, our method incorporates such weight in the transition probabilities.\n\nmaxff (x); 1g , where ^c is a scaling value given by ^c = 1=jXoj\u2211\n\njf (x)(cid:0)^c(cid:25)(cid:18)(x)j\n\nx2XnXo\n\n\u2211\n\nx2Xo\n\n).\n\n\u2032\n\n\u2032\n\n1\n\n6.2 Experiment on real-world traf\ufb01c data\n\nWe tested our method through a city-wide traf\ufb01c-monitoring task as shown in Fig. 1. The goal is to\nestimate the traf\ufb01c volume along an arbitrary road segment (or link of a network), given observed\ntraf\ufb01c volumes on a limited number of the links, where a link corresponds to the state x of M((cid:18)), and\nthe traf\ufb01c volume along x corresponds to f (x) of Eq. (5). The traf\ufb01c volumes along the observable\nlinks were reliably estimated from real-world web-camera images captured in Nairobi, Kenya [2,\n\n7\n\n\f(A)\n\n(B)\n\n(C)\n\nFigure 2: (A) Comparison of RMAE for the synthetic task between our methods and the NWKR\n(baseline method). (B) Traf\ufb01c volumes for a city center map in Nairobi, Kenya, I: Web-camera\nobservations (colored), II: Estimated traf\ufb01c volumes by our method. (C) Comparison between the\nNWKR and our method for the real traf\ufb01c-volume prediction problem.\n\n\u2032\n\n) here because of its unavailability. Note that this task\n15], while we did not use the hitting rate g(x; x\nis similar to network tomography [27, 30] or link-cost prediction [32, 14]. However, unlike network\ntomography, we need to infer all of the link traf\ufb01cs instead of source-destination demands. Unlike\nlink-cost prediction, our inputs are stationary observations instead of trajectories. Again, we use the\nNMKR as the baseline method. The road network and the web-camera observations are shown in\nFig. 2 (B)-I. While the total number of links was 1; 497, the number of links with observations was\nonly 52 (about 3:5%). We used the parametric models in Section 5, where \u03d5T(x) 2 [(cid:0)1; 1] was set\nbased on the road category of x such that primary roads have a higher value than secondary roads\n\u2032. However, we omitted the\n[22], and (x; x\nterms of \u03d5I(x) in Eq. (17).\nFigure 2 (B)-II shows an example of our results, where the red and yellow roads are most congested\nwhile the traf\ufb01c on the blue roads is \ufb02owing smoothly. The congested roads from our analysis\nare consistent with those from a local traf\ufb01c survey report [13]. Figure 2 (C) shows comparison\nbetween predicted and observed travel volumes. In the \ufb01gures, the 45o line corresponds to perfect\nagreement between the actual and predicted values. To evaluate accuracy, we employed the leave-\none-out cross-validation. We can see that the proposed method gives a good performance. This is\nrather surprising, because the rate of observation links is very limited to only 3:5 percent.\n\n) 2 [(cid:0)1; 1] was the cosine of the angle between x and x\n\n\u2032\n\n7 Conclusion\n\nWe have de\ufb01ned a novel inverse problem of a Markov chain, where we infer the probabilities about\nthe initial states and the transitions, using a limited amount of information that we can obtain by\nobserving the Markov chain at a small number of states. We have proposed an effective objective\nfunction for this problem as well as an algorithm based on natural gradient.\nUsing real-world data, we have demonstrated that our approach is useful for a traf\ufb01c monitoring\nsystem that monitors the traf\ufb01c volume at limited number of locations. From this observation the\nMarkov chain model is inferred, which in turn can be used to deduce the traf\ufb01c volume at any\nlocation. Surprisingly, even when the observations are made at only several percent of the locations,\nthe proposed method can successfully infer the traf\ufb01c volume at unobserved locations.\nFurther analysis of the proposed method is necessary to better understand its property and effec-\ntiveness. In particular, our future work includes an analysis of model identi\ufb01ability and empirical\nstudies with other applications, such as logistics and economic system modeling.\n\nAcknowledgments\nThe authors thank Dr. R. Morris, Dr. R. Raymond, and Mr. T. Katsuki for fruitful discussion.\n\n8\n\n030609000.511.522.5# of observation statesRMAE Proposed methodProposed method with no use of gNadaraya\u2212Watson kernel regression36.7936.836.8136.8236.8336.8436.85\u22121.31\u22121.305\u22121.3\u22121.295\u22121.29\u22121.285\u22121.28\u22121.275\u22121.27\u22121.265\u22121.26I36.7936.836.8136.8236.8336.8436.85\u22121.31\u22121.305\u22121.3\u22121.295\u22121.29\u22121.285\u22121.28\u22121.275\u22121.27\u22121.265\u22121.26II10\u2212110010110\u22121100101ObservationEstimationNWKR(RMAE: 1.01 \u00b1 0.917)10\u2212110010110\u22121100101ObservationEstimationProposed method(RMAE: 0.517 \u00b1 0.669)\fReferences\n[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proc. of Interna-\n\ntional Conference on Machine learning, 2004.\n\n[2] AccessKenya.com. http://traffic.accesskenya.com/.\n[3] S. Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 10(2):251\u2013276, 1998.\n[4] S. Amari and H. Nagaoka. Method of Information Geometry. Oxford University Press, 2000.\n[5] S. Amari, H. Park, and K. Fukumizu. Adaptive method of realizing natural gradient learning for multilayer\n\nperceptrons. Neural Computation, 12(6):1399\u20131409, 2000.\n\n[6] H. Balzter. Markov chain models for vegetation dynamics. Ecological Modelling, 126(2-3):139\u2013154,\n\n2000.\n\n[7] J. Baxter and P. L. Bartlett. In\ufb01nite-horizon policy-gradient estimation. Journal of Arti\ufb01cial Intelligence\n\nResearch, 15:319\u2013350, 2001.\n\n[8] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[9] H. H. Bui, S. Venkatesh, and G. West. On the recognition of abstract Markov policies. In Proc. of AAAI\n\nConference on Arti\ufb01cial Intelligence, pages 524\u2013530, 2000.\n\n[10] S. L. Chang, L. S. Chen, Y. C. Chung, and S. W. Chen. Automatic license plate recognition. In Proc. of\n\nIEEE Transactions on Intelligent Transportation Systems, pages 42\u201353, 2004.\n\n[11] E. Crisostomi, S. Kirkland, and R. Shorten. A Google-like model of road network dynamics and its\n\napplication to regulation and control. International Journal of Control, 84(3):633\u2013651, 1995.\n\n[12] M. Gamon and A. C. K\u00a8onig. Navigation patterns from and to social media. In Proc. of AAAI Conference\n\non Weblogs and Social Media, 2009.\n\n[13] J. E. Gonzales, C. C. Chavis, Y. Li, and C. F. Daganzo. Multimodal transport in Nairobi, Kenya: Insights\nand recommendations with a macroscopic evidence-based model. In Proc. of Transportation Research\nBoard 90th Annual Meeting, 2011.\n\n[14] T. Id\u00b4e and M. Sugiyama. Trajectory regression on road networks.\n\nArti\ufb01cial Intelligence, pages 203\u2013208, 2011.\n\nIn Proc. of AAAI Conference on\n\n[15] T. Katasuki, T. Morimura, and T. Id\u00b4e. Bayesian unsupervised vehicle counting. In Technical Report. IBM\n\nResearch, RT0951, 2013.\n\n[16] D. Levin, Y. Peres, and E. Wilmer. Markov Chains and Mixing Times. American Mathematical Society,\n\n2008.\n\n[17] D. MacKay. Information theory, inference, and learning algorithms. Cambridge University Press, 2003.\nIn Proc. of\n[18] T. Morimura and S. Kato. Statistical origin-destination generation with multiple sources.\n\nInternational Conference on Pattern Recognition, pages 283\u2013290, 2012.\n\n[19] T. Morimura, E. Uchibe, J. Yoshimoto, and K. Doya. A generalized natural actor-critic algorithm. In\n\nProc. of Advances in Neural Information Processing Systems, volume 22, 2009.\n\n[20] A. Y. Ng and S. Russell. Algorithms for inverse reinforcement learning. In Proc. of International Con-\n\nference on Machine Learning, 2000.\n\n[21] J. R. Norris. Markov Chains. Cambridge University Press, 1998.\n[22] OpenStreetMap. http://wiki.openstreetmap.org/.\n[23] C. C. Pegels and A. E. Jelmert. An evaluation of blood-inventory policies: A Markov chain application.\n\nOperations Research, 18(6):1087\u20131098, 1970.\n\n[24] J.A. Quinn and R. Nakibuule. Traf\ufb01c \ufb02ow monitoring in crowded cities. In Proc. of AAAI Spring Sympo-\n\nsium on Arti\ufb01cial Intelligence for Development, 2010.\n\n[25] C. M. Roberts. Radio frequency identi\ufb01cation (RFID). Computers & Security, 25(1):18\u201326, 2006.\n[26] S. M. Ross. Stochastic processes. John Wiley & Sons Inc, 1996.\n[27] S. Santini. Analysis of traf\ufb01c \ufb02ow in urban areas using web cameras. In Proc. of IEEE Workshop on\n\nApplications of Computer Vision, pages 140\u2013145, 2000.\n\n[28] R. R. Sarukkai. Link prediction and path analysis using Markov chains. Computer Networks, 33(1-\n\n6):377\u2013386, 2000.\n\n[29] G. Tauchen. Finite state Markov-chain approximations to univariate and vector autoregressions. Eco-\n\nnomics Letters, 20(2):177\u2013181, 1986.\n\n[30] Y. Zhang, M. Roughan, C. Lund, and D. Donoho. An information-theoretic approach to traf\ufb01c matrix es-\ntimation. In Proc. of Conference on Applications, technologies, architectures, and protocols for computer\ncommunications, pages 301\u2013312. ACM, 2003.\n\n[31] J. Zhu, J. Hong, and J. G. Hughes. Using Markov chains for link prediction in adaptive Web sites. In\nProc. of Soft-Ware 2002: Computing in an Imperfect World, volume 2311, pages 60\u201373. Springer, 2002.\n[32] B. D. Ziebart, A. L. Maas, and A. K. Dey J. A. Bagnell. Maximum entropy inverse reinforcement learning.\n\nIn Proc. of AAAI Conference on Arti\ufb01cial Intelligence, pages 1433\u20131438, 2008.\n\n9\n\n\f", "award": [], "sourceid": 809, "authors": [{"given_name": "Tetsuro", "family_name": "Morimura", "institution": "IBM Research"}, {"given_name": "Takayuki", "family_name": "Osogami", "institution": "IBM Research"}, {"given_name": "Tsuyoshi", "family_name": "Ide", "institution": "IBM Research"}]}