{"title": "Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 409, "page_last": 416, "abstract": "", "full_text": "Fast Algorithms for Large-State-Space HMMs with\n\nApplications to Web Usage Analysis\n\nPedro F. Felzenszwalb1, Daniel P. Huttenlocher2, Jon M. Kleinberg2\n\n1AI Lab, MIT, Cambridge MA 02139\n\n2Computer Science Dept., Cornell University, Ithaca NY 14853\n\nAbstract\n\nIn applying Hidden Markov Models to the analysis of massive data\nstreams, it is often necessary to use an arti(cid:12)cially reduced set of\nstates; this is due in large part to the fact that the basic HMM\nestimation algorithms have a quadratic dependence on the size of\nthe state set. We present algorithms that reduce this computational\nbottleneck to linear or near-linear time, when the states can be\nembedded in an underlying grid of parameters. This type of state\nrepresentation arises in many domains; in particular, we show an\napplication to tra(cid:14)c analysis at a high-volume Web site.\n\n1\n\nIntroduction\n\nHidden Markov Models (HMMs) are used in a wide variety of applications where\na sequence of observable events is correlated with or caused by a sequence of un-\nobservable underlying states (e.g., [8]). Despite their broad applicability, HMMs\nare in practice limited to problems where the number of hidden states is relatively\nsmall. The most natural such problems are those where some abstract categoriza-\ntion provides a small set of discrete states, such as phonemes in the case of speech\nrecognition or coding and structure in the case of genomics. Recently, however,\nissues arising in massive data streams, such as the analysis of usage logs at high-\ntra(cid:14)c Web sites, have led to problems that call naturally for HMMs with large state\nsets over very long input sequences.\n\nA major obstacle in scaling HMMs up to larger state spaces is the computational\ncost of implementing the basic primitives associated with them: given an n-state\nHMM and a sequence of T observations, determining the probability of the observa-\ntions, or the state sequence of maximum probability, takes O(T n2) time using the\nforward-backward and Viterbi algorithms. The quadratic dependence on the num-\nber of states is a long-standing bottleneck that necessitates a small (often arti(cid:12)cially\ncoarsened) state set, particularly when the length T of the input is large.\n\nIn this paper, we present algorithms that overcome this obstacle for a broad class\nof HMMs. We improve the running times of the basic estimation and inference\nprimitives to have a linear or near-linear dependence on the number of states, for\na family of models in which the states are embedded as discrete grid points in an\nunderlying parameter space, and the state transition costs (the negative logs of\n\n\fthe state transition probabilities) correspond to a possibly non-metric distance on\nthis space. This kind of embedded-state model arises in many domains, including\nobject tracking, de-noising one-dimensional signals, and event detection in time\nseries. Thus the algorithms can be seen as extending the applicability of HMMs to\nproblems that are traditionally solved with more restricted linear Gaussian state-\nspace models such as Kalman (cid:12)ltering. Non-Gaussian state-space techniques are\na research focus in their own right (e.g., [6]) and our methods could be used to\nimprove their e(cid:14)ciency.\n\nGiven a structured embedding of states in an underlying d-dimensional space, our\napproach is to reduce the amount of work in the dynamic programming iterations\nof the Viterbi and forward-backward algorithms. For the Viterbi algorithm, we\nmake use of distance transform (also known as Voronoi surface) techniques, which\nare widely used in computer vision, image processing, and discrete computational\ngeometry [2]. For a broad class of distance functions on the embedding space\n(including functions that are far from obeying the triangle inequality), we are able\nto run each dynamic programming step of the Viterbi algorithm in O(n) time,\nyielding an overall running time of O(T n). In the case of the forward-backward\nalgorithm, we are able to achieve O(T n) time for any transition probabilities that\ncan be decomposed into a constant number of box (cid:12)lters [10]. Box (cid:12)lters are discrete\nconvolution kernels that can be computed in linear time; many functions, including\nthe Gaussian, can be expressed or approximated as the composition of a few box\n(cid:12)lters. Moreover, in the case of the forward-backward algorithm, we are able to\nobtain a running time of O(T n log n) for arbitrary state transition probabilities, as\nlong as they are based only on di(cid:11)erences in the embedded positions of the states.\n\nA motivating application for our work comes from the analysis of Web usage data\n[1]. We focus on the Internet Archive site (www.archive.org) as a prototypical\nexample of a high-tra(cid:14)c site (millions of page-visits per month) o(cid:11)ering an array\nof digital items for download. An important question at such a site is to determine\nvariations in user interest in the items being o(cid:11)ered. We use a coin-tossing HMM\nmodel in which the discrete states correspond to the current probability of a user\ndownloading a given item; this state set has a natural embedding in the interval\n[0; 1]. We study the e(cid:11)ect of increasing the number of states, and (cid:12)nd that a fairly\nlarge state set (of size roughly a hundred or more) is needed in order to detect\nbrief but signi(cid:12)cant events that a(cid:11)ect the download rate. With tens of millions of\nobservations and a state set of this size, practical analysis would be computationally\nprohibitive without the faster HMM algorithms described here.\n\nIt should be noted that our methods can also be used in belief revision and belief\npropagation algorithms for Bayesian networks (e.g., [7]), as these algorithms are\nessentially variants of the Viterbi and forward-backward algorithms for HMMs.\nThe methods are also applicable to continuous Markov models, which have recently\nbeen employed for Web user modeling based on duration of page views [9].\n\n2 Hidden Markov Models\n\nWe brie(cid:13)y review HMMs; however we assume that the reader is familiar both with\nHMMs and with the Viterbi and forward-backward estimation algorithms. Rabiner\n[8] provides a good introduction to HMMs; we use notation similar to his. An HMM\ncan be represented by a 5-tuple (cid:21) = (S; V; A; B; (cid:25)) where S = fs1; : : : ; sng is a (cid:12)nite\nset of (hidden) states, V = fv1; : : : ; vmg is a (cid:12)nite set of observable symbols, A is an\nn (cid:2) n matrix with entries aij corresponding to the probability of going from state i\nto state j, B = fbi(k)g where bi(k) speci(cid:12)es the probability of observing symbol vk\nin state si, and (cid:25) is an n-vector with each entry (cid:25)i corresponding to the probability\n\n\fFunction\n\naij = p if ji (cid:0) jj (cid:20) d,\naij = 0 otherwise\naij / exp((cid:0)ji (cid:0) jj2=2(cid:27)2)\naij / exp((cid:0)kji (cid:0) jj)\naij = p if ji (cid:0) jj (cid:20) d,\naij = q otherwise\naij / exp((cid:0)ji (cid:0) jj2=2(cid:27)2)\n\nif ji (cid:0) jj (cid:20) d,\n\naij / exp((cid:0)kji (cid:0) jj) otherwise\n\nViterbi\n\nMin-(cid:12)lter\n\nL2\n2 dist. trans.\nL1 dist. trans.\n\nForward-Backward\n\nBox sum\n\nGaussian approx.\nFFT\n\nCombin. min-(cid:12)lter\n\nCombin. box sum\n\nCombin. dist. trans. FFT\n\nTable 1: Some transition probabilities that can be handled e(cid:14)ciently using our\ntechniques (see text for an explanation). All running times are O(T n) except those\nusing the FFT which are O(T n log n).\n\nthat the initial state of the system is si.\n\nLet qt denote the state of the system at time t, while ot denotes the observed\nsymbol at time t. Given a sequence of observations O = (o1; : : : ; oT ) there are three\nstandard estimation (or inference) problems that have wide applications:\n\n1. Find a state sequence Q = (q1; : : : ; qT ) maximizing P (QjO; (cid:21)).\n\n2. Compute P (Oj(cid:21)), the probability of an observation sequence being gener-\n\nated by (cid:21).\n\n3. Compute the posterior probabilities of each state, P (qt = sijO; (cid:21)).\n\nAs is well known these problems can be solved in O(T n2) time using the Viterbi\nalgorithm for the (cid:12)rst task and the forward-backward algorithm for the others. We\nshow how to solve them more e(cid:14)ciently for a wide range of transition probabili-\nties based on di(cid:11)erences between states that are embedded in an underlying grid.\nThis grid can be multi-dimensional, however in this paper we consider only the\none-dimensional case. Table 1 lists some widely applicable transition probability\ndistributions that can be handled by our methods. The algorithms for each distri-\nbution di(cid:11)er slightly and are explained in the subsequent sections. The distributions\ngiven in the bottom part of the table can be computed as combinations of the ba-\nsic distributions in the top part. Other distributions can be obtained using these\nsame combination techniques, as long as only a constant number of distributions\nare being combined.\n\nAn additional problem, which we do not explicitly consider here, is that of determin-\ning the best model (cid:21) given some set of observed sequences fO1; : : : ; Olg. However\nthe most widely used technique for solving this problem, expectation maximiza-\ntion (EM), requires repeatedly running the forward-backward algorithm. Thus our\nalgorithms also indirectly make the model learning problem more e(cid:14)cient.\n\n2.1 Viterbi Algorithm\n\nThe Viterbi algorithm is used to (cid:12)nd a maximum posterior probability state se-\nquence, that is a sequence Q = (q1; : : : ; qT ) maximizing P (QjO; (cid:21)). The main\ncomputation is to determine the highest probability along a path, accounting for\nthe observations and ending in a given state. While there are an exponential num-\nber of possible paths, the Viterbi algorithm uses a dynamic programming approach\n\n\fFigure 1: An example of the L1 distance transform for a grid with n = 9 points\ncontaining the point set P = f1; 3; 7g. The distance transform value at each point\nis given by the height of the lower envelope, depicted as a dashed contour.\n\n(see e.g., [8]), employing the recursive equation\n\n(cid:14)t+1(j) = bj(ot+1) max\n\ni\n\n((cid:14)t(i)aij) ;\n\nwhere (cid:14)t(i), for i = 1; 2; : : : ; n, encodes the highest probability along a path which\naccounts for the (cid:12)rst t observations and ends in state si. The maximization term\ntakes O(n2) time, resulting in an overall time of O(T n2) for a sequence of length\nT . Computing (cid:14)t for each time step is only the (cid:12)rst pass of the Viterbi algorithm.\nIn a subsequent backward pass, a minimizing path is found. This takes only O(T n)\ntime, so the forward computation is the dominant part of the running time.\n\nj(ot+1) + mini((cid:14)0\n\nIn general a variant of the Viterbi algorithm is employed that uses negative\nlog probabilities rather than probabilities, such that the computation becomes\n(cid:14)0\nt+1(j) = b0\nij), where 0 is used to denote a negative log\nprobability. We now turn to the computation of (cid:14) 0 for restricted forms of the tran-\nsition costs a0\nij , where there is an underlying parameter space such that the costs\ncan be expressed in terms of a distance between parameter values corresponding to\nthe states. Let us denote such cost functions by (cid:26)(i (cid:0) j). Then,\n\nt(i) + a0\n\n(cid:14)0\nt+1(j) = b0\n\nj(ot+1) + min\n\ni\n\n((cid:14)0\n\nt(i) + (cid:26)(i (cid:0) j)) :\n\n(1)\n\nWe now show how the minimization in the second term can be computed in O(n)\ntime rather than O(n2). The approach is based on a generalization of the distance\ntransform, which is de(cid:12)ned for sets of points on a grid. Consider a grid with N\nlocations and a point set P on that grid. The distance transform of P speci(cid:12)es for\neach grid location, the distance to the closest point in the set P ,\n\nDP (j) = min\ni2P\n\n(cid:26)(i (cid:0) j):\n\nClearly the distance transform can be computed in O(N 2) time by considering all\npairs of grid locations. However, it can also be computed in linear time for many\ndistance functions using simple algorithms (e.g., [2, 5]). These algorithms have small\nconstants and are fast in practice. The algorithms work for distance transforms of\nd-dimensional grids, not just for the one-dimensional case that we illustrate here.\n\nIn order to compute the distance transform e(cid:14)ciently it is commonly expressed as,\n\nDP (j) = min\n\ni\n\n((cid:26)(i (cid:0) j) + 1(i)) ;\n\nwhere 1(i) is an indicator function for the set P such that 1(i) = 0 when i 2 P and\n1(i) = 1 otherwise. Intuitively one can think of a collection of upward facing cones,\none rooted at each grid location that is in the set P . The transform is then obtained\nby taking the lower envelope (or minimum) of these cones. For concreteness consider\n\n\fthe one-dimensional case with the L1 distance between grid locations. In this case\nthe \\cones\" are v-shapes of slope 1 rising from the value y = 0 at each grid location\nthat corresponds to a point of the set P , as illustrated in Figure 1.\n\nIt is straightforward to verify that a simple two-pass algorithm correctly computes\nthis one-dimensional distance transform. First the vector D(j) is initialized to 1(j).\nThen in the forward pass, each successive element of D(j) is set to the minimum\nof its own value and one plus the value of the previous element (this is done \\in\nplace\" so that updates a(cid:11)ect one another).\n\nj = 1; :::; n (cid:0) 1 : D(j) = min(D(j); D(j (cid:0) 1) + 1):\n\nThe backward pass is analogous,\n\nj = n (cid:0) 2; :::; 0 : D(j) = min(D(j); D(j + 1) + 1):\n\nConsider the example in Figure 1. After the initialization step the value of D is\n(1; 0; 1; 0; 1; 1; 1; 0; 1), after the forward pass it is (1; 0; 1; 0; 1; 2; 3; 0; 1) and\nafter the backward pass the (cid:12)nal answer of (1; 0; 1; 0; 1; 2; 1; 0; 1).\n\nThis computation of the distance transform does not depend on the form of the\nfunction 1(i). This suggests a generalization of distance transforms where the indi-\ncator function 1(i) is replaced with an arbitrary function,\n\nDf (j) = min\n\ni\n\n((cid:26)(i (cid:0) j) + f (i)) :\n\nThe same observation was used in [4] to e(cid:14)ciently compute certain tree-based cost\nfunctions for visual recognition of multi-part objects. Intuitively, the upward-facing\ncones are now rooted at height f (i) rather than at zero, and are positioned at every\ngrid location. The function Df is as above the lower envelope of these cones.\n\nThis generalized distance transform Df is precisely the form of the minimization\nterm in the computation of the Viterbi recursion (cid:14) 0 in equation (1), where each\nstate corresponds to a grid point. The algorithm above can be used to compute\neach step of the Viterbi minimization in O(n) time when (cid:26) is the L1 norm, giving\nan O(T n) algorithm overall. This corresponds to the problem in the third row of\nTable 1. The computation for the second row of the table is similar, except that\ncomputing the distance transform for the L2 distance squared is a bit more involved\n(see [5]). The distribution in the (cid:12)rst row of the table can be handled using a linear\ntime algorithm for the min-(cid:12)lter [3].\n\nCombinations of transforms can be formed by computing each function separately\nand then taking the minimum of the results. The entries in the bottom part of\nTable 1 show two such combinations. The function in the fourth row is often of\npractical interest, where the probability is p of staying near the current state and\nq of transitioning to any other state. The function in the last row is a so-called\n\\truncated quadratic\", arising commonly in robust statistics. In the experimental\nsection we use a similar function that is the combination of two linear components\nwith di(cid:11)erent slopes.\n\n2.2 Forward-Backward Algorithm\n\nThe forward-backward algorithm is used to (cid:12)nd the probability of the observed\nsequence given the the model, P (Oj(cid:21)). The computation also determines the pos-\nterior probability of the states at each time, P (qt = sijO; (cid:21)). Most of the work\nin the forward-backward algorithm is spent in determining the so-called forward\nand backward probabilities at each step (again see [8] or any other introduction to\nHMMs). The forward probabilities at a given time can be expressed as the n-vector\n\n(cid:11)t(i) = P (o1; o2; : : : ; ot; qt = sij(cid:21));\n\n\fi.e., the probability of the partial observation sequence up until time t and the\nstate at time t, given the model. The backward probabilities (cid:12)t can be expressed\nanalogously and are not considered here. The standard computation is to express\nthe vector (cid:11)t recursively as\n\n(cid:11)t+1(j) = bj(ot+1)\n\nn\n\nX\n\ni=1\n\n((cid:11)t(i)aij ) :\n\nIn this form it is readily apparent that computing (cid:11)t+1 from (cid:11)t involves O(n2)\noperations, as each of the n entries in the vector involves a sum of n terms.\n\nWhen the transition probabilities are based just on the di(cid:11)erences between the\nunderlying coordinates corresponding to the states, aij = h(j (cid:0) i), the recursive\ncomputation of (cid:11) becomes\n\n(cid:11)t+1(j) = bj(ot+1)\n\nn\n\nX\n\ni=1\n\n((cid:11)t(i)h(j (cid:0) i)) :\n\nThe summation term is simply the convolution of (cid:11)t with h. In general, this discrete\nconvolution can be computed in O(n log n) time using the FFT. While this is a\nsimple observation, it enables e(cid:14)cient calculation of the forward and backward\nprobabilities for problems where the states are embedded in a grid.\n\nIn certain speci(cid:12)c cases convolution can be computed in linear time. One case\nof particular interest is the so-called box sum, in which the convolution kernel is a\nconstant function within a region. That is, h(j) = k over some interval and h(j) = 0\noutside that interval. A Gaussian can be well approximated by convolution of just\na few such box (cid:12)lters [10], and thus it is possible to approximately compute the\nfunctions in the (cid:12)rst and second rows of Figure 1 in O(T n) time. Similarly to the\nViterbi case, functions can be created from combinations of box-sums. In this case\na weighted sum of the individual functions is used rather than their minimum.\n\n3 Coin-Tossing Models and Web Usage Analysis\n\nWe now turn to the application mentioned in the introduction: using a coin-tossing\nmodel with a one-dimensional embedding of states to estimate the download prob-\nability of items at a Web site. Our data comes from the Internet Archive site\n(www.archive.org), which o(cid:11)ers digital text, movie, and audio (cid:12)les. Each item on\nthe site has a separate description page, which contains the option to download it;\nthis is similar to the paper description pages on CiteSeer or the e-print arXiv and\nto the item description pages at online retailers (with the option to purchase). On a\nsite of this type, the probability that a user chooses to acquire an item, conditioned\non having visited the description page, can be viewed as a measure of interest [1].\n\nThis ratio of acquisitions to visits is particularly useful as a way of tracking the\nchanges in user interest in an item. Suppose the item is featured prominently on\nthe site; or an active o(cid:11)-site link to the item description drives a new sub-population\nof users to it; or a technical problem makes it impossible to obtain the item | these\nare all discrete events that can have a sudden, signi(cid:12)cant e(cid:11)ect on the fraction of\nusers who download the item. By identifying such discrete changes, we can discover\nthe most signi(cid:12)cant events, both on the site and on the Web at large, that have\na(cid:11)ected user interest in each item. Such a history of events can be useful to site\nadministrators, as feedback to the users of the site, and for researchers.\n\nThis type of change-detection (cid:12)ts naturally into the framework of HMMs. For a\n(cid:12)xed item, each observation corresponds to a user\u2019s visit to the item description,\n\n\fi\n\ni\n\n \n\nt\ni\ns\nV\ng\nn\nd\nn\no\np\ns\ne\nr\nr\no\nC\n\n \nr\no\n\nf\n \n\nt\n\ne\na\nS\n\nt\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nStep Size .1\n\n1.03e+09\n\n1.035e+09\n\n1.04e+09\nTime of Visit\n\n1.045e+09\n\n1.05e+09\n\ni\n\ni\n\n \n\nt\ni\ns\nV\ng\nn\nd\nn\no\np\ns\ne\nr\nr\no\nC\n\n \nr\no\n\nf\n \n\nt\n\ne\na\nS\n\nt\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nStep Size .01\n\n1.03e+09\n\n1.035e+09\n\n1.04e+09\nTime of Visit\n\n1.045e+09\n\n1.05e+09\n\nFigure 2: Estimate of underlying download bias; best state sequence for models\nwith step sizes of .1 (9 states) on the left and .01 (81 states) on the right.\n\nand there are two observable symbols V = f1; 0g, corresponding to the decision\nto download or not. We assume a model in which there is a hidden coin of some\nunknown bias that is (cid:13)ipped when the user visits the description and whose outcome\ndetermines the download decision. Thus, each state si corresponds to a discretized\nvalue pi of the underlying bias parameter. The natural observation cost function\nb0\ni(k) is simply the negative log of the probability p for a head and (1 (cid:0) p) for a tail.\nThe points at which state transitions occur in the optimal state sequence thus\nbecome candidates for discrete changes in user interest. The form of the state tran-\nsition costs is based on our assumptions about the nature of these changes. As\nindicated above, they often result from the introduction of a new sub-population\nwith di(cid:11)erent interests or expectations; thus, it is natural to expect that the tran-\nsition cost should rise monotonically as the change in bias increases, but that even\nlarge changes should happen with some regularity.\n\nWe quantize the underlying bias parameter values equally, such that jpi(cid:0)pjj / ji(cid:0)jj\nand use a cost function of the form\n\na0\nij = min (k1ji (cid:0) jj; k2ji (cid:0) jj + k3) ;\n\nwhere the ki are positive constants and k1 > k2. This two-slope linear model is\nmonotone increasing but once the change in bias becomes large enough the rate of\nincrease is small. The model prefers constant or small changes in bias but allows\nfor arbitrarily large changes, similarly to the \\truncated model\" common in robust\nstatistics.\n\nFigure 2 shows the best state sequence obtained with the Viterbi algorithm under\nthis model, using two di(cid:11)erent discretizations of the parameter space, for an input\nsequence of 11; 159 visits from August 2002 to April 2003 to the description page\nfor a particular video in the Internet Archive. On the left is a 9-state model with\nprobabilities ranging from .1 to .9 in steps of size .1. On the right is an 81-state\nmodel with the same range of .1 to .9 but where the steps are of size .01. The x-axis\nshows the visit time (UTC in billions of seconds since the epoch) and the y-axis\nshows the bias associated with the state in the optimal sequence at that time.\n\nWe begin by observing that both models capture a number of discrete changes\nin download behavior. These changes correspond to genuine external events. In\nparticular, both models capture the long-term drop and rebound in bias which cor-\nresponds to the time period where the item was highlighted on a top-level page,\nas well as the two rightmost short downward spikes which correspond to technical\nproblems that made downloads temporarily impossible. Even though these latter\n\n\ffailures were relatively short-lived, lasting a few hours out of the several-month\nrange, they are detected easily by the stochastic model; in contrast, temporal win-\ndowing techniques miss such short events.\n\nThe two plots, however, exhibit some subtle but important di(cid:11)erences that illustrate\nthe qualitatively greater power we obtain from a larger state set.\nIn particular,\nthe 81-state model has four short downward spikes rather than three in the time\ninterval from 1.045 to 1.05. The latter two are the technical failures identi(cid:12)ed by\nboth models, but the (cid:12)rst two correspond to two distinct o(cid:11)-site referring pages each\nof which drove a signi(cid:12)cant amount of low-interest user tra(cid:14)c to the item. While\nthe 81-state model was able to resolve these as separate events, the 9-state model\nblurs them into an arti(cid:12)cial period of medium bias, followed by a downward spike\nto the lowest possible state (i.e. the same state it used for the technical failures).\nFinally, the 81-state model discovers a gradual decline in the download rate near\nthe end of the plot that is not visible when there are fewer states.\n\nWe see that a model with a larger state set is able to pick up the e(cid:11)ects of di(cid:11)erent\ntypes of events | both on-site and o(cid:11)-site highlighting of the item, as well as tech-\nnical problems | and that these events often result in sudden, discrete changes.\nMoreover, it appears that beyond a certain point, the set of signi(cid:12)cant events re-\nmains roughly (cid:12)xed even as the resolution in the state set increases. While we do\nnot show the result here, an 801-state model with step size .001 produces a plot\nthat is qualitatively indistinguishable from the 81 state model with step size .01 |\nonly the y-values provide more detail with the smaller step size.\n\nReferences\n\n[1] J. Aizen, D. Huttenlocher, J. Kleinberg, A. Novak, \\Tra(cid:14)c-Based Feedback on\n\nthe Web,\" To appear in Proceedings of the National Academy of Sciences.\n\n[2] G. Borgefors, \\Distance Transformations in Digital Images\", Computer Vision,\n\nGraphics and Image Processing, Vol. 34, pp. 344-371, 1986.\n\n[3] Y. Gil and M. Werman, \\Computing 2D Min, Max and Median Filters\" IEEE\n\nTrans. PAMI, Vol. 15, 504-507, 1993.\n\n[4] P. Felzenszwalb, D. Huttenlocher, \\E(cid:14)cient Matching of Pictorial Structures,\"\n\nProc. IEEE Computer Vision and Pattern Recognition Conf., 2000, pp. 66-73.\n\n[5] A. Karzanov, \\Quick algorithm for determining the distances from the points of\nthe given subset of an integer lattice to the points of its complement\", Cybernetics\nand System Analysis, 1992. (Translation from the Russian by Julia Komissarchik.)\n[6] G. Kitagawa, \\Non-Gaussian State Space Modeling of Nonstationary Time Se-\n\nries\", Journal of the American Statistical Association, 82, pp. 1032-1063, 1987.\n\n[7] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible\n\nInference, Morgan Kaufmann, 1988.\n\n[8] L. Rabiner, \\A tutorial on hidden Markov models and selected applications in\n\nspeech recognition,\" Proceedings of the IEEE Vol. 77(2), pp. 257-286, 1989.\n\n[9] S.L. Scott, P. Smyth, \\The Markov Modulated Poisson Process and Markov\nPoisson Cascade with Applications to Web Tra(cid:14)c Data,\" Bayesian Statistics\n7(2003), to appear.\n\n[10] W.M. Wells, \\E(cid:14)cient synthesis of Gaussian (cid:12)lters by cascaded uniform (cid:12)l-\n\nters\", IEEE Trans. PAMI, Vol. 8(2), pp. 234-239, 1986.\n\n\f", "award": [], "sourceid": 2525, "authors": [{"given_name": "Pedro", "family_name": "Felzenszwalb", "institution": null}, {"given_name": "Daniel", "family_name": "Huttenlocher", "institution": null}, {"given_name": "Jon", "family_name": "Kleinberg", "institution": null}]}