{"title": "Online Markov Decoding: Lower Bounds and Near-Optimal Approximation Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 5680, "page_last": 5690, "abstract": "We resolve the fundamental problem of online decoding with general nth order ergodic Markov chain models. Specifically, we provide deterministic and randomized algorithms whose performance is close to that of the optimal offline algorithm even when latency is small. Our algorithms admit efficient implementation via dynamic programs, and readily extend to (adversarial) non-stationary or time-varying settings. We also establish lower bounds for online methods under latency constraints in both deterministic and randomized settings, and show that no online algorithm can perform significantly better than our algorithms. To our knowledge, our work is the first to analyze general Markov chain decoding under hard constraints on latency. We provide strong empirical evidence to illustrate the potential impact of our work in applications such as gene sequencing.", "full_text": "Online Markov Decoding: Lower Bounds and\n\nNear-Optimal Approximation Algorithms\n\nVikas K. Garg\n\nMIT\n\nvgarg@csail.mit.edu\n\nTamar Pichkhadze\n\nMIT\n\ntamarp@alum.mit.edu\n\nAbstract\n\nWe resolve the fundamental problem of online decoding with general nth order er-\ngodic Markov chain models. Speci\ufb01cally, we provide deterministic and randomized\nalgorithms whose performance is close to that of the optimal of\ufb02ine algorithm even\nwhen latency is small. Our algorithms admit ef\ufb01cient implementation via dynamic\nprograms, and readily extend to (adversarial) non-stationary or time-varying set-\ntings. We also establish lower bounds for online methods under latency constraints\nin both deterministic and randomized settings, and show that no online algorithm\ncan perform signi\ufb01cantly better than our algorithms. To our knowledge, our work\nis the \ufb01rst to analyze general Markov chain decoding under hard constraints on\nlatency. We provide strong empirical evidence to illustrate the potential impact of\nour work in applications such as gene sequencing.\n\n1\n\nIntroduction\n\nMarkov models, in their various incarnations, have for long formed the backbone of diverse applica-\ntions such as telecommunication [1], biological sequence analysis [2], protein structure prediction [3],\nlanguage modeling [4], automatic speech recognition [5], \ufb01nancial modeling [6], gesture recognition\n[7], and traf\ufb01c analysis [8, 9]. In a Markov chain model of order n, the conditional distribution of\nnext state at any time i depends only on the current state and the previous n \u2212 1 states, i.e.,\n\nP(yi|y1, . . . , yi\u22121) = P(yi|yi\u2212n, . . . , yi\u22121) \u2200i .\n\nOften, the states are not directly accessible but need to be inferred or decoded from the observations,\ni.e., a sequence of tokens emitted by the states. For instance, in tagging applications [10], each\nstate pertains to a part-of-speech tag (e.g. noun, adjective) and each word wi in an input sentence\nw = (w1, . . . , wT ) needs to be labeled with a probable tag yi that might have emitted the word.\nThus, it is natural to endow each state with a distribution over the tokens it may emit. For example,\nnth order hidden Markov models (n-HMM) [11] and (n + 1)-gram language models [4] assume the\njoint distribution P(y, w) of states y = (y1, . . . , yT ) and observations w factorizes as\n\nT(cid:89)\n\nP(y, w) =\n\nP(yi|yi\u2212n, . . . , yi\u22121) P(wi|yi) ,\n\ni=1\n\nwhere y\u2212n+1, . . . , y0 are dummy states, and the transition distributions P(yi|yi\u2212n, . . . , yi\u22121) and\nthe emission distributions P(wi|yi) are estimated from data. We call a Markov model ergodic if there\nis a diameter \u2206 such that any state can be reached from any other state in at most \u2206 transitions. For\ninstance, a fully connected Markov chain pertains to \u2206 = 1. Note that having \u2206 > 1 is often natural,\ne.g., two successive punctuation marks (such as semicolons) are unlikely in an English document.\nWhen the transition distributions do not change with time i, the model is called time-homogeneous,\notherwise it is non-stationary, time-varying or non-homogeneous [12, 13, 14]. Given a sequence w\nof T observations, the decoding problem is to infer a most probable sequence or path y\u2217 of T states\n\ny\u2217 \u2208 argmax\n\ny\n\nP(y, w) = argmax\n\nlog P(y, w) .\n\ny\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fReward Ri(yi|y[i\u2212n,i\u22121])\nModel\n(n + 1)-GRAM log P(yi|yi\u2212n, . . . , yi\u22121) + log P(wi|yi) 1-HMM log P(yi|yi\u22121) + log P(wi|yi)\n\nReward Ri(yi|y[i\u2212n,i\u22121])\n\nModel\n\nn-MEMM\n\nlog\n\nn-CRF\n\n\u03b8(cid:62)\u03c6(y[i\u2212n,i\u22121], yi, w, i)\n\n(cid:80)\n\nexp(\u03b8(cid:62)\u03c6(y[i\u2212n,i\u22121], yi, w, i))\ny(cid:48)\n\nexp(\u03b8(cid:62)\u03c6(y[i\u2212n,i\u22121], y(cid:48)\n\ni, w, i))\n\ni\n\nTable 1: Standard Markov models in the reward form. We use y[i,j] to denote (yi, yi+1, . . . , yj).\n\nT(cid:88)\n\ni=1\n\nDecoding is a key inference problem in other structured prediction settings [15, 16] as well, e.g.,\nmaximum entropy Markov models (MEMM) [17] and conditional random \ufb01elds (CRF) [18, 19]\nemploy learnable parameters \u03b8 and de\ufb01ne the conditional dependence of each state on the observations\nthrough feature functions \u03c6. The decoding task in all these models can be expressed in the form\n\ny\u2217 \u2208 argmax\n\ny\n\nRi(yi|yi\u2212n, . . . , yi\u22121) ,\n\n(1)\n\nwhere we have made the dependence on observations w implicit in the reward functions Ri as shown\nin Table 1. The Viterbi algorithm [1] is employed for solving problems of the form (1) exactly.\nHowever, the algorithm cannot decode any observation until it has processed the entire observation\nsequence, i.e., computed and stored for each state s a most probable sequence of T states that ends\nin s. We say an algorithm has a hard latency L if L is the smallest B such that the algorithm needs\nto access at most B + 1 observations wi, wi+1, . . . , wi+B to generate the label for observation wi\nat any time i during the decoding process. Thus, the latency of Viterbi algorithm on a sequence of\nlength T is T \u2212 1, which is prohibitive for large T , especially in memory impoverished systems such\nas IoT devices [20, 21, 22, 23]. Besides, the algorithm is not suitable for critical scenarios such as\npatient monitoring, intrusion detection, and credit card fraud monitoring where delay following the\nonset of a suspicious activity might be detrimental [24]. Moreover, low latency is desirable for tasks\nsuch as drug discovery that rely on detecting interleaved coding regions in massive gene sequences.\nA lot of effort has been, and continues to be, invested into speeding up the Viterbi algorithm,\nor reducing its memory footprint [25]. Some prominent recent approaches include fast matrix\nmultiplication [26], compression and storage reduction for HMM [27, 28, 29], and heuristics such as\nbeam search and simulated annealing [30, 31]. Several of these methods are based on the observation\nthat if all the candidate subsequences in the Viterbi algorithm converge at some point, then all\nsubsequent states will share a common subsequence up to that point [32, 33]. However, these\nmethods do not guarantee reduction in latency since, in the worst case, they still need to process\nall the rewards before producing any output. [24] introduced Online Step Algorithm (OSA), with\nprovable guarantees, to handle soft latency requirements in \ufb01rst order models. However, OSA makes\na strong assumption that uncertainty in any state label decreases with latency. This assumption does\nnot hold for important applications such as genome data. Moreover, OSA does not provide a direct\ncontrol over latency (which needs to be tuned), and is limited to \ufb01rst order fully connected settings.\nWe draw inspiration from, and generalize, the work by [34] on online server allocation under what\nwe view as \ufb01rst order fully connected non-homogeneous setting (when the number of servers is one).\n\nOur contributions\n\nWe investigate the problem of online decoding with Markov chain models under hard latency\nconstraints, and design almost optimal online deterministic and randomized algorithms for problems\nof the form (1). Our bounds apply to general settings, e.g., when the rewards vary with time (non-\nhomogeneous settings), or even when they are presented in an adversarial or adaptive manner. Our\nguarantees hold for \ufb01nite latency (i.e. not only asymptotically), and improve with increase in latency.\nOur algorithms are ef\ufb01cient dynamic programs that may not only be deployed in settings where\nViterbi algorithm is typically used but also, as we mentioned earlier, several others where it is\nimpractical. Thus, our work would potentially widen the scope of and expedite scienti\ufb01c discovery in\nseveral \ufb01elds that rely critically on ef\ufb01cient online Markov decoding.\nWe also provide the \ufb01rst results on the limits of online Markov decoding under latency constraints.\nSpeci\ufb01cally, we craft lower bounds for the online approximation of Viterbi algorithm in both deter-\nministic and randomized ergodic chain settings. Moreover, we establish that no online algorithm\ncan perform signi\ufb01cantly better than our algorithms. In particular, our algorithms provide strong\nguarantees even for low latency, and nearly match the lower bounds for suf\ufb01ciently large latency.\n\n2\n\n\fDETERMINISTIC\n(\u2206 = 1, n = 1)\n\nRANDOMIZED\n(\u2206 = 1, n = 1, \u0001 > 0)\n\nDETERMINISTIC\n\n1 +\n\n(cid:18)\n\n\u02dc\u2206\nL\n\n1 +\n\nLOWER BOUND\n\n1 +\n\n1\nL\n\n1 +\n\nL2 + 1\n\n1\n\n+\n(1 \u2212 \u0001)\nL + \u0001\n\n\u02dc\u2206 + L \u2212 1\n\n(L \u2212 \u02dc\u2206 \u2212 1)2 + 4 \u02dc\u2206L \u2212 3 \u02dc\u2206\n\n(cid:0)2\u2206\u22121(cid:100)1/\u0001(cid:101) \u2212 1(cid:1) n\n\n2\u2206\u22121(cid:100)1/\u0001(cid:101)L + n\n\nUPPER BOUND (OUR ALGORITHMS)\n\n(cid:27)\n\nL\u221a\nL + 1, 1 +\n\n4\n\nL \u2212 7\n\n(cid:26)(cid:18)\n\nmin\n\n1 +\n\n(cid:19)\n\n1\nL\n\n(cid:26)\n\n(cid:19)\n\n1 + min\n\n(cid:18)\n\n\u0398\n\n(cid:19)\n\n,\n\n1 +\n\n1\nL\n\n(cid:18) log L\n(cid:19)(cid:27)\n(cid:19)\n\nL \u2212 \u02dc\u2206 + 1\n1\n\n1\n\nL \u2212 \u02dc\u2206 + 1\n\n\u0398\n\n(cid:18)\n\nL \u2212 8 \u02dc\u2206 + 1\n\nRANDOMIZED (\u0001 > 0)\n\n1 +\n\n1 + \u0398\n\nTable 2: Summary of our results in terms of the competitive ratio \u03c1. Note that the effective diameter\n\u02dc\u2206 = \u2206 + n \u2212 1. To \ufb01t some results within the margins, we use the standard notation \u0398(\u00b7) on\nthe growth of functions and summarize the performance of Peek Search asymptotically in L. The\nnon-asymptotic dependence on L is made precise in all cases in our theorem statements.\n\nWe introduce several novel ideas and analyses in the context of approximate Markov decoding.\nFor example, we approximate a non-discounted objective over horizon T by a sequence of smaller\ndiscounted subproblems over horizon L + 1, and track down the Viterbi algorithm by essentially fore-\ngoing rewards on at most \u02dc\u2206 = \u2206 + n \u2212 1 steps in each smaller problem. Our design of constructions\ntoward proving lower bounds in a setting predicated on interplay of several heterogeneous variables,\nnamely L, n, and \u2206, is another signi\ufb01cant technical contribution. We believe our tools will foster\ndesigning new online algorithms, and establishing combinatorial bounds for related settings such as\ndynamic Bayesian networks, hidden semi-Markov models, and model based reinforcement learning.\n\n2 Overview of our results\nWe introduce some notation. We de\ufb01ne [a, b] (cid:44) (a, a + 1, . . . , b) and [N ] (cid:44) (1, 2, . . . , N ). Likewise,\ny[N ] (cid:44) (y1, . . . , yN ) and y[a,b] (cid:44) (ya, . . . , yb). We denote the last n states visited by the online\nalgorithm at time i by \u02c6y[i\u2212n,i\u22121], and those by the optimal of\ufb02ine algorithm by y\u2217\n[i\u2212n,i\u22121]. De\ufb01ning\npositive reward functions Ri = Ri + p by adding a suf\ufb01ciently large positive number p to each\nreward, we note from (1) that an optimal sequence of states for input observations w of length T is\n\n[1,T ] \u2208 arg max\ny\u2217\n\ny1,...,yT\n\nRi(yi|y[i\u2212n,i\u22121]) .\n\n(2)\n\nWe use OP T to denote the total reward accumulated by the optimal of\ufb02ine algorithm, and ON to\ndenote that received by the online algorithm. We evaluate the performance of any online algorithm in\nterms of its competitive ratio \u03c1, which is de\ufb01ned as the ratio OP T /ON. That is,\n\nT(cid:88)\n\ni=1\n\n\u03c1 =\n\nRi(y\u2217\n\ni |y\u2217\n\n[i\u2212n,i\u22121])\n\nRi(\u02c6yi|\u02c6y[i\u2212n,i\u22121]) .\n\nT(cid:88)\n\ni=1\n\n(cid:30) T(cid:88)\n\ni=1\n\nClearly, \u03c1 \u2265 1. Our goal is to design online algorithms that have competitive ratio close to 1. For\nrandomized algorithms, we analyze the ratio obtained by taking expectation of the total online reward\nover its internal randomness. The performance of any online algorithm depends on the order n,\nlatency L, and diameter \u2206. Table 2 provides a summary of our results. Note that our algorithms are\nasymptotically optimal in L. For the \ufb01nite L case, we \ufb01rst consider the fully connected \ufb01rst order\nmodels. Our randomized algorithm matches the lower bound even1 with L = 1 since we may set \u0001\narbitrarily close to 0. Note that just with L = 1, our deterministic algorithm achieves a competitive\n\n1It is easy to construct examples where any algorithm with no latency may be made to incur an arbitrarily\nhigh \u03c1. Thus, in the fully connected \ufb01rst order Markov setting, online learning is meaningful only for L \u2265 1.\n\n3\n\n\fratio 4, and this ratio reduces further as L increases. Moreover our ratio rapidly approaches the lower\nbound with increase in L. Finally, in the general setting, our algorithms are almost optimal when L\nis suf\ufb01ciently large compared to \u02dc\u2206 = \u2206 + n \u2212 1. We call \u02dc\u2206 the effective diameter since it nicely\nencapsulates the roles of order n and diameter \u2206 toward the quality of approximation.\nThe rest of the paper is organized as follows. We \ufb01rst introduce and analyze our deterministic Peek\nSearch algorithm for homogeneous settings in section 3. We then introduce the Randomized Peek\nSearch algorithm in section 4. In section 5, we propose the deterministic Peek Reset algorithm that\nperforms better than deterministic Peek Search for large L. We then present the lower bounds in\nsection 6, and demonstrate the merits of our approach via strong empirical evidence in section 7. We\nanalyze the non-homogeneous setting, and provide all the proofs in the supplementary material.\n\n3 Peek Search\n\nOur idea is to approximate the sum of rewards over T steps in (2) by a sequence of smaller problems\nover L + 1 steps. The Peek Search algorithm is so named since at each time i, besides the observation\nwi, it peeks into the next L observations wi+1, . . . , wi+L. The algorithm then leverages the sub-\nsequence w[i,i+L] to decide its next state \u02c6yi. Let \u03b3 \u2208 (0, 1) be a discount factor. Speci\ufb01cally, the\nalgorithm repeats the following procedure at each time i. First, it \ufb01nds a path of length L + 1\nemanating from the current state \u02c6yi\u22121 that fetches maximum discounted reward. The discounted\nreward on any path is computed by scaling down the (cid:96)th edge, (cid:96) \u2208 {0, . . . , L}, on the path by \u03b3(cid:96).\nThen, the algorithm moves to the \ufb01rst state of this path and repeats the procedure at time i + 1. Note\nthat at time i + 1, the algorithm need not continue with the second edge on the optimal discounted\npath computed at previous time step i, and is free to choose an alternative path. Formally, at time\ni, the algorithm computes \u02dcyi (cid:44) (\u02dcy0\ni ) that maximizes the following objective over valid\npaths y = (yi, . . . , yi+L),\n\ni , . . . , \u02dcyL\n\ni , \u02dcy1\n\nn\u22121(cid:88)\n\nL(cid:88)\n\nR(yi|\u02c6y[i\u2212n,i\u22121]) +\n\n\u03b3jR(yi+j|\u02c6y[i\u2212n+j,i\u22121], y[i,i+j\u22121]) +\n\n\u03b3jR(yi+j|y[i+j\u2212n,i+j\u22121]) ,\n\nj=1\n\nj=n\n\ni , and receives the reward R(\u02c6yi|\u02c6y[i\u2212n,i\u22121]). Note that we have dropped\nsets the next state \u02c6yi = \u02dcy0\nthe subscript i from Ri since in the homogeneous settings, the reward functions do not change with\ntime i. For any given L and \u02dc\u2206, we optimize to get the optimal \u03b3. Intuitively, \u03b3 may be viewed as an\nexplore-exploit parameter that indicates the con\ufb01dence of the online algorithm in the best discounted\npath: \u03b3 grows as L increases, and thus a high value of \u03b3 indicates that the path computed at a time i\nmay be worth tracing at subsequent few steps as well. In contrast, the algorithm is uncertain for small\nvalues of L. We have the following near-optimal result on the performance of Peek Search.\nTheorem 1. The competitive ratio of Peek Search on Markov chain models of order n with diameter\n\u2206 for L \u2265 \u2206 + n \u2212 1 is \u03c1 \u2264 (\u03b3\u2206+n\u22121 \u2212 \u03b3L+1)\u22121. Setting \u03b3 =\n\n(L\u2212\u2206\u2212n+2)\n\n, we get\n\n\u03c1 \u2264\n\nL + 1\n\nL \u2212 \u2206 \u2212 n + 2\n\n(cid:18) L + 1\n\n\u2206 + n \u2212 1\n\n(cid:19)(n+\u2206\u22121)/(L\u2212\u2206\u2212n+2)\n\n(cid:114) \u2206 + n \u2212 1\n(cid:18) log L\n\nL + 1\n\n(cid:19)\n\n= 1 + \u0398\n\nL \u2212 \u02dc\u2206 + 1\n\n.\n\nProof. (Sketch) We \ufb01rst consider the fully connected \ufb01rst order setting (i.e. n = 1, \u2206 = 1). Our\nanalysis hinges on two important facts. Since Peek Search chooses a path that maximizes the total\ndiscounted reward over next (L + 1) steps, it is guaranteed to fetch all of the discounted reward\npertaining to the optimal path except that available on the \ufb01rst step of the optimal path (see Fig. 1 for\nvisual intuition). Alternatively, Peek Search could have persisted with the maximizing path computed\nat the previous time step (recall that only \ufb01rst step of this path was taken to reach the current state).\nWe exploit the fact that this path is now worth 1/\u03b3 times its anticipated value at the previous step.\nNow consider n > 1. The online algorithm may jump to any state on the optimal of\ufb02ine, i.e. Viterbi\npath, in one step. However, the reward now depends on the previous n states, and so the online\nalgorithm may have to wait additional n \u2212 1 steps before it could trace the subsequent optimal path.\nFinally, as explained in Fig. 1, when \u2206 > 1, the online algorithm may have to forfeit rewards on (at\nmost) \u2206 steps, in addition to the n \u2212 1 steps, in order to join the optimal path.\n\n4\n\n\fa\n\na\n\na\n\na\n\n1\n\nc\n\nc\n\nc\n\nc\n\n\u03b3\n\nc\n\nc\n\nc\n\nc\n\n\u03b32\n\na\n\na\n\na\n\na\n\n\u03b33\n\na\n\na\n\na\n\na\n\n\u03b3\n\na\n\na\n\na\n\na\n\nc\n\nc\n\nc\n\nc\n\n1\n\nc\n\nc\n\nc\n\nc\n\n\u03b32\n\n\u03b33\n\na\n\na\n\na\n\na\n\na\n\na\n\na\n\na\n\nFigure 1: Visual intuition for the setting n = 1. (Left) A trellis diagram obtained by unrolling a\nfully connected Markov graph (i.e. diameter \u2206 = 1). The states are shown along the rows, and time\nalong the columns. The system is currently in state 4 (shown in red), and has access to rewards and\nobservations (shown inside circles) for the next (L + 1) steps. The unknown optimal path is shown\nin blue, and the weights with which rewards are scaled are shown on the edges. One option available\nwith the online algorithm is to jump to state 1 (possibly fetching zero reward) and then follow the\noptimal path for the subsequent L steps. Note that the online algorithm might choose a different path,\nbut it is guaranteed at least as much reward since it maximizes the discounted reward over L + 1 steps.\n\u03b3 approaches 1 with increase in L. This would ensure that the online algorithm makes nearly the\nmost of L steps every L + 1 steps. (Right) If the graph is not fully connected, some of the transitions\nmay not be available (e.g. state 4 to state 1 in our case). Therefore, the online algorithm might not be\nable to join the optimal path in one step, and thus may have to forgo additional rewards.\n\nWe show in the supplementary material that this guarantee on the performance of Peek Search\nextends to the non-homogeneous settings, including those where the rewards may be adversarially\nchosen. Note that na\u00efvely computing a best path by enumerating all paths of length L + 1 would\nbe computationally prohibitive since the number of such paths is exponential in L. Fortunately, we\ncan design an ef\ufb01cient dynamic program for Peek Search. Speci\ufb01cally, we can show that for every\n(cid:96) \u2208 {1, 2, . . . , L}, the reward on the optimal discounted path of length (cid:96) can be recursively computed\nfrom an optimal path of length (cid:96)-1 using O(|K|n) computations. We have the following result.\nTheorem 2. Peek Search can compute a best \u03b3-discounted path for the next L + 1 steps, in nth order\nMarkov chain models, in time O(L|K|n), where K is the set of states.\nWe outline an ef\ufb01cient procedure, underlying Theorem 2, in the supplementary material. We now\nintroduce two algorithms that do not recompute the paths at each time step. These algorithms provide\neven tighter (expected) approximation guarantees than Peek Search for larger values of the latency L.\n\n4 Randomized Peek Search\n\nWe \ufb01rst introduce the Randomized Peek Search algorithm, which removes the asymptotic log factor\nfrom the competitive ratio in Theorem 1. Unlike Peek Search, this method does not discount the\nrewards on paths. Speci\ufb01cally, the algorithm \ufb01rst selects a reset point (cid:96) uniformly at random from\n{1, 2, . . . , L + 1}. This number is a private information for the online algorithm. The randomized\nalgorithm recomputes the optimal non-discounted path (which corresponds to \u03b3 = 1) of length\n(L + 1), once every L + 1 steps, at each time i\u2217 (L + 1) + (cid:96), and follows this path for next L + 1 steps\nwithout any updates. We have the following result that underscores the bene\ufb01ts of randomization.\nTheorem 3. Randomized Peek Search achieves, in expectation, on Markov chain models of order n\nwith diameter \u2206 a competitive ratio\n\n(cid:18)\n\n(cid:19)\n\n\u03c1 \u2264 1 +\n\n\u2206 + n \u2212 1\n\nL + 1 \u2212 (\u2206 + n \u2212 1)\n\n= 1 + \u0398\n\n1\n\nL \u2212 \u02dc\u2206 + 1\n\n.\n\nProof. (Sketch) Since it maximizes the non-discounted reward, for each random reset point (cid:96), the\nonline algorithm receives at least as much reward as the optimal of\ufb02ine algorithm minus the reward\non at most \u02dc\u2206 steps every L + 1 steps. We show that, in expectation, Peek Reset misses on only (at\nmost) a \u02dc\u2206/(L + 1) fraction of the optimal of\ufb02ine reward.\n\n5\n\n\fTheorem 2 is essentially tight since it nearly matches the lower bound as described previously in\nsection 2. We leverage insights from Randomized Peek Search to translate its almost optimal expected\nperformance to the deterministic setting. Speci\ufb01cally, we introduce the Peek Reset algorithm that may\nbe loosely viewed as a derandomization of Randomized Peek Search. The main trick is to conjure a\nsequence of reset points, each over a variable number of steps. This allows the algorithm to make\nadaptive decisions about when to forgo rewards. Both Randomized Peek Search and Peek Reset can\ncompute rewards on their paths ef\ufb01ciently by using the procedure for Peek Search as a subroutine.\n\n5 Peek Reset\n\nWe now present the deterministic Peek Reset algorithm that performs better than Peek Search when\nthe latency L is suf\ufb01ciently large. Like Randomized Peek Search, Peek Reset recomputes a best\nnon-discounted path and takes multiple steps on this path. However, the number of steps taken is\nnot \ufb01xed to L + 1 but may vary in each phase. Speci\ufb01cally, let (i) denote the time at which phase i\nbegins. The algorithm follows, in phase i, a sequence of states \u02c6y(i) (cid:44) (\u02c6y(i), \u02c6y(i)+1, . . . , \u02c6yTi\u22121) that\nmaximizes the following objective over valid paths y = (y(i), . . . , yTi\u22121) :\n\nf (y) (cid:44) R(y(i)|\u02c6y[(i)\u2212n,(i)\u22121]) +\n\n+\n\nR(y(i)+j|\u02c6y[(i)\u2212n+j,(i)\u22121], y[(i),(i)+j\u22121])\n\nR(y(i)+j|y[(i)+j\u2212n,(i)+j\u22121]) ,\n\nn\u22121(cid:88)\nTi\u2212(i)\u22121(cid:88)\n\nj=1\n\nj=n\n\nwhere Ti is chosen from the following set (breaking ties arbitrarily)\n\narg min\nt\u2208[(i)+L/2+1,(i)+L]\n\nmax\n\n(yt\u2212n,...,yt)\n\nR(yt|y[t\u2212n,t\u22121]) .\n\nThen, the next phase (i + 1) begins at time Ti. We have the following result.\nTheorem 4. The competitive ratio of Peek Reset on Markov chain models of order n with diameter\n\u2206 for latency L is\n\n\u03c1 \u2264 1 +\n\n2(\u2206 + n)(\u2206 + n \u2212 1)\nL \u2212 8(\u2206 + n \u2212 1) + 1\n\n= 1 + \u0398\n\n1\n\nL \u2212 8 \u02dc\u2206 + 1\n\n.\n\n(cid:18)\n\n(cid:19)\n\nProof. (Sketch) The algorithm gives up reward on at most \u02dc\u2206 steps every L + 1 steps, however these\nsteps are cleverly selected. Note that Ti is chosen from the interval [(i) + L/2 + 1, (i) + L], which\ncontains steps from both phases (i) and (i + 1). Thus, the algorithm gets to peek into phase (i + 1)\nbefore deciding on the number of steps to be taken in phase (i).\n\nA comparison of Theorem 4 with Theorem 1 reveals that Peek Reset provides better upper bounds on\nthe approximation quality than Peek Search for suf\ufb01ciently large latency. In particular, for the fully\nconnected \ufb01rst order setting, i.e. \u02dc\u2206 = 1, the competitive ratio of Peek Reset is at most 1 + 4/(L \u2212 7)\nwhich is better than the corresponding worst case bound for Peek Search when L \u2265 50. Thus, Peek\nSearch is better suited for applications with severe latency constraints whereas Peek Reset may be\npreferred in less critical scenarios. We now establish that no algorithm, whether deterministic or\nrandomized, can provide signi\ufb01cantly better guarantees than our algorithms under latency constraints.\n\n6 Lower Bounds\n\nWe now state our lower bounds on the performance of any deterministic and any randomized algorithm\nin the general non-homogeneous ergodic Markov chain models. The proofs revolve around our novel\n\u2206-dimensional prismatic polytope constructions, where each vertex corresponds to a state. We\ndisentangle the interplay between L, \u2206, and n (see Fig. 2 for visual intuition).\nTheorem 5. The competitive ratio of any deterministic online algorithm on nth order (time-varying)\nMarkov chain models with diameter \u2206 for latency L is greater than\n\n(cid:32)\n\n1 +\n\n\u02dc\u2206\nL\n\n1 +\n\n(cid:33)\n\n.\n\n\u02dc\u2206 + L \u2212 1\n\n( \u02dc\u2206 + L \u2212 1)2 + \u02dc\u2206\n\n6\n\n\fC\n\nc\n\na\n\nb\n\nB\n\nb(cid:48)\n\nA\n\nB(cid:48)\n\nc(cid:48)\n\nC(cid:48)\n\na(cid:48)\n\nA(cid:48)\n\n(a) Deterministic setting\n\n(b) Randomized setting\n\nFigure 2: Constructions for lower bounds with \u2206 = 3. (Left) ABC and abc are opposite faces of\na triangular prism, and A(cid:48)B(cid:48)C(cid:48) and a(cid:48)b(cid:48)c(cid:48) are their translations. The resulting prismatic polytope has\nthe property that distance between the farthest vertices is \u2206. Different colors are used for edges on\ndifferent faces, and same color for translated faces to aid visualization (we have also omitted some\nedges that connect faces to their translated faces, in order to avoid clutter). A priori the rewards for\nthe L + 1 steps are same across all vertices (i.e. states). Thus, due to symmetry of the polytope,\nthe online algorithm arbitrarily chooses some vertex (shown here in green). The states that can be\nreached via shortest paths of same length from this vertex are displayed in same color (magenta, red,\nor orange). The adversary reveals the rewards for an additional, i.e. (L + 2)th, time step such that\nstates at distance d \u2208 [\u2206] from the green state would fetch (n + d \u2212 1)\u03b1 for some \u03b1, while the green\nstate would yield 0. Under the Markov dependency rule that a state yields reward only if it has been\nvisited n consecutive times, the online algorithm fails to obtain any reward in the (L + 2)th step\nregardless of the state sequence it traces. The optimal algorithm, due to prescience, gets the maximum\npossible reward (n + \u2206 \u2212 1)\u03b1 for this step. (Right) In the randomized setting, all states fetch zero\nreward at the \ufb01nal step except a randomly chosen state (shown in green) that yields reward n. The\nprobability that the randomized online algorithm correctly guesses the green state at the initial time\nstep is exponentially small in \u2206. In all other cases, it must forgo this reward, and thus its expected\nreward is low compared to the optimal algorithm for large \u2206.\n\nIn particular, when n = 1, \u2206 = 1, the ratio is larger than 1 +\n\n1\nL\n\n+\n\n1\n\nL2 + 1\n\n.\n\nTheorem 6. For any \u0001 > 0, the competitive ratio of any randomized online algorithm, that is allowed\n(1 \u2212 \u0001)n\nlatency L, on nth order (time-varying) Markov chain models with \u2206 = 1 is at least 1 +\nL + \u0001n\n\n.\n\n(cid:0)2\u2206\u22121(cid:100)1/\u0001(cid:101) \u2212 1(cid:1) n\n\n2\u2206\u22121(cid:100)1/\u0001(cid:101)L + n\n\n.\n\nFor a general diameter \u2206, the competitive ratio is at least 1 +\n\nWe now analyze the performance of our algorithms in the wake of these lower bounds. Note that when\n\u02dc\u2206 = 1, Randomized Peek Search (Theorem 3) matches the lower bound in Theorem 6 even with\nL = 1, since we may set \u0001 arbitrarily close to 0. Similarly, in the deterministic setting, Peek Search\nachieves a competitive ratio of 4 with L = 1 (Theorem 1), which is within twice the theoretically\nbest possible performance (i.e. a ratio of 2.5) as speci\ufb01ed by Theorem 5. Moreover, the performance\napproaches the lower bound with increase in L as Peek Reset takes center stage (Theorem 4). In the\ngeneral setting, our algorithms are almost optimal when L is suf\ufb01ciently large compared to \u02dc\u2206.\nNote that we do not make any distributional assumptions on the rewards for any (L + 1)-long peek\nwindow. Thus, our algorithms accommodate various settings, including those where the rewards may\nbe revealed in an adaptive (e.g. non-stochastic, possibly adversarial) manner.\nWe now proceed to our experiments that accentuate the practical implications of our work.\n\n7\n\n\f\u221239\n\u221240\n\u221241\n\u221242\n\u221243\n\u221244\n\u221245\n\u221246\n\n)\n0\n0\n0\n0\n1\n\ne\nl\na\nc\ns\n(\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n-\ng\no\nL\n\nLog-probability\n\nDecoding agreement with Viterbi\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\nt\nn\ne\nm\ne\ne\nr\ng\na\n\nf\no\nn\no\ni\nt\nc\na\nr\nF\n\nViterbi\nPeek Search\nOSA\nRandomized Peek Search\nPeek Reset\n\nPeek Search\nOSA\nRandomized Peek Search\nPeek Reset\n\n1\n\n3\n\n5\n\n7\n\n9\n\n11\nLatency L\n\n13\n\n15\n\n17\n\n19\n\n0.5\n\n1\n\n3\n\n5\n\n7\n\n9\n\n11\nLatency L\n\n13\n\n15\n\n17\n\n19\n\nFigure 3: Evaluation of performance on genome sequence data. The data consists of 73385 sites,\neach of which is to be labeled with one of the four states. The log-probability values on the right\nhave been scaled down by a factor of 104 to avoid clutter near the vertical axis. Peek Search achieves\nalmost optimal performance with a latency of only about 20, which is over three orders of magnitude\nless than the optimal Viterbi algorithm. The corresponding predictions agreed with the Viterbi\nalgorithm on more than 95% of all sites. Peek Reset and Randomized Peek Search also performed\nwell especially for larger values of L. In contrast, OSA was found to be signi\ufb01cantly suboptimal.\n\n7 Experiments\n\nWe describe the results of our experiments on two real datasets. We \ufb01rst compare the performance of\nour methods with the state-of-the-art Online Step Algorithm (OSA) [24] that also provides theoretical\nguarantees for \ufb01rst order Markov decoding under latency constraints. OSA hinges on a strong\nassumption that uncertainty in any state label decreases with increase in latency. We found that\nthis assumption does not hold in the context of an important application, namely, genome decoding.\nIn contrast, since our algorithms do not make any such assumptions, they achieve much better\nperformance as expected. Furthermore, unlike our algorithms, OSA does not provide a direct control\nover the latency L. Speci\ufb01cally, OSA relies on a hyperparameter \u03bb, that may require extensive tuning,\nto achieve a good trade-off between latency and accuracy. Our empirical \ufb01ndings thus underscore\nthe promise of our algorithms toward expediting scienti\ufb01c progress in \ufb01elds like drug discovery. We\nthen demonstrate that Peek Search performs exceptionally well on the task of part-of-speech tagging\non the Brown corpus data even for L = 1. We also provide evidence that heuristics such as Beam\nSearch can be adapted to approximate optimal discounted paths ef\ufb01ciently within peek windows of\nlength (L + 1). This computational bene\ufb01t, however, comes at the expense of theoretical guarantees.\n\n7.1 Genome sequencing\n\nWe experimented with the Glycerol TraSH genome data [35] pertaining to M. tuberculosis transposon\nmutants. Our task was to label each of the 73385 gene sites with one of the four states, namely\nessential (ES), growth-defect (GD), non-essential (NE), and growth-advantage (GA). These states\nrepresent different categories of gene essentiality depending on their read-counts (i.e. emissions), and\nthe labeling task is crucial toward identifying potential drug targets for antimicrobial treatment [35].\nWe used the parameter settings suggested by [35] for decoding with an HMM.\nNote that for this problem, the Viterbi algorithm and heuristics such as beam search need to compute\nthe optimal paths of length equal to the number of sites, i.e. in excess of 73000, thereby incurring\nvery high latency. However, as Fig. 3 shows, Peek Search achieved near-optimal log-probability\n(the Viterbi objective in (1)) with a latency of only about 20, which is less than that of Viterbi by\na factor in excess of 3500. Moreover, the state sequence output by Peek Search agreed with the\nViterbi labels on more than 95% of the sites. We observe that, barring downward blips from L = 1\nto L = 3 and from L = 9 to L = 11, the performance improved with L. As expected, for all\nL, including those featuring in the blips, the log-probability values were veri\ufb01ed to be consistent\nwith our theoretical guarantees. On the other hand, we found OSA to be signi\ufb01cantly suboptimal\nin terms of both log-probability and label agreement. In particular, OSA agreed with the optimal\nalgorithm (Viterbi) on only 58.8% of predictions under both entropy and expected classi\ufb01cation error\n\n8\n\n\fLatency\n\nMethod\nViterbi\n\nPeek Search\n\nL = 1\nL = 1 Approximate Peek Search (3 beams)\nL = 2\nL = 2 Approximate Peek Search (3 beams)\nL = 3\nL = 3 Approximate Peek Search (3 beams)\n\nPeek Search\n\nPeek Search\n\nLog-probability Tagging accuracy (%)\n-117.29 +/- .53\n-117.40 +/- .54\n-117.40 +/- .54\n-117.34 +/- .54\n-117.34 +/- .54\n-117.33 +/- .54\n-117.33 +/- .54\n\n97.4 +/- .02\n97.0 +/- .01\n97.0 +/- .02\n97.2 +/- .01\n97.2 +/- .01\n97.3 +/- .02\n97.3 +/- .02\n\nTable 3: Part-of-speech tagging on Brown data.\n\nmeasures suggested in [24]. In contrast, just with L = 1, Peek Search matched with Viterbi on 77.4%\npredictions thereby outperforming OSA by an overwhelming amount (over 30%). We varied the OSA\nhyperparameter \u03bb \u2208 {10\u22124, 10\u22121, . . . , 104} under both the entropy and the expected classi\ufb01cation\nerror measures suggested by [24] to tune for L (as noted in [24], large values of \u03bb penalize latency).\nHowever, the performance of OSA (as shown in Fig. 3) did not show any improvement.\nFig. 3 also shows the performance of Randomized Peek Search (averaged over 10 independent runs)\nand Peek Reset. Since the guarantees of Peek Reset are meaningful for L large enough (exceeding\n7), we show results with Peek Reset for L \u2265 9. Both these methods were found to be better than\nOSA on the genome data. Moreover, as expected, the performance of Peek Reset improved with\nincrease in L. In particular, the scaled log-probabilities under Peek Reset for L = 50 and L = 100\nwere observed, respectively, to be -39.69 and -39.56. Moreover, the decoded sequences agreed with\nViterbi on 97.32% and 98.68% of the sites respectively. For smaller values of L, Peek Search turned\nout to be better than both Peek Reset and Randomized Peek Search.\nWe also evaluated the performance of Beam Search. Note that despite ef\ufb01cient greedy path expansion,\nBeam Search with k beams (BS-k) has high latency (same as Viterbi) since no labels can be generated\nuntil the k-greedy paths are computed for entire sequence and backpointers are traced back to the\nstart. We found that BS-2 performed worse than Peek Search for L \u2265 5. Also, BS-3 recorded log\nprob.-39.61 and decoding agreement 97.73% (worse than Peek Search with L = 50). BS-k matched\nthe Viterbi performance for k \u2265 4.\nFinally, note that instead of choosing \u03b3 optimally, one could \ufb01x \u03b3 to some other value in Peek Search.\nIn particular, setting \u03b3 = 1 amounts to having Peek Search move one step on a path with maximum\nnon-discounted reward during each peek window. We found that Peek Search with \u03b3 = 1 obtained a\nsub-optimal scaled log-probability of -45.86 (and 58.8% decoding match with Viterbi). However,\nsetting \u03b3 optimally did not make any difference for larger L.\n\n7.2 Part-of-speech tagging\n\nFor our second task, we focused on the problem of decoding the part-of-speech (POS) tags for\nsentences in the standard Brown corpus data. The corpus comprises 57340 sentences of different\nlengths that have 1161192 tokens in total. The corpus is not divided into separate train and test sets.\nTherefore, we formed 5 random partitions each having 80% train and 20% test sentences. The train\nset was used to estimate the parameters of a \ufb01rst order HMM, and these parameters were then used\nto predict the tags for tokens in the test sentences. For each test sentence that had all its tokens\nobserved in the train data, we computed its log-probability using its predicted tags (note that the\nViterbi algorithm maximizes this quantity in (1) exactly).\nWe computed the average log-probability over these test sentences for both the Viterbi algorithm,\nand Peek Search for different values of latency L. We also computed the accuracy of tag predictions,\ni.e. the fraction of test tokens whose predicted tags matched their ground truth labels. We report the\nresults (averaged over 5 independent train-test partitions) in Table 3. We observed that Peek Search\nnearly matched the performance of Viterbi.2 Moreover, similar results were obtained when we used\n3 beams to approximate the optimal \u03b3-discounted reward within each (L + 1)-long peek window.\nThus, we can potentially design fast yet accurate heuristics for some low latency settings.\n\n2We found that OSA also achieved almost optimal performance on the Brown corpus.\n\n9\n\n\fReferences\n[1] A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding\n\nalgorithm. IEEE Trans. Information Theory, 13(2):260\u2013269, 1967.\n\n[2] O. Gotoh. Modeling one thousand intron length distributions with \ufb01tild. Bioinformatics,\n\n34(19):3258\u20133264, 2018.\n\n[3] W. Chu, Z. Ghahramani, and D. L. Wild. A graphical model for protein secondary structure\n\nprediction. In International Conference on Machine Learning (ICML), 2004.\n\n[4] K. Hea\ufb01eld, I. Pouzyrevsky, J. H. Clark, and P. Koehn. Scalable modi\ufb01ed kneser-ney language\nmodel estimation. In Association for Computational Linguistics (ACL), pages 690\u2013696, 2013.\n[5] S. Bengio. An asynchronous hidden markov model for audio-visual speech recognition. In\n\nNeural Information Processing Systems (NIPS), pages 1237\u20131244, 2003.\n\n[6] J. Bulla and I. Bulla. Stylized facts of \ufb01nancial time series and hidden semi-markov models.\n\nComput. Stat. Data Anal., 51(4), 2006.\n\n[7] S. B. Wang, A. Quattoni, L.-P. Morency, and D. Demirdjian. Hidden conditional random \ufb01elds\nfor gesture recognition. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 1521\u20131527, 2006.\n\n[8] P. F. Felzenszwalb, D. P. Huttenlocher, and J. M. Kleinberg. Fast algorithms for large-state-space\nhmms with applications to web usage analysis. In Neural Information Processing Systems\n(NIPS), pages 409\u2013416, 2003.\n\n[9] A. Thiagarajan, L. Ravindranath, H. Balakrishnan, S. Madden, and L. Girod. Accurate, low-\nenergy trajectory mapping for mobile devices. In Networked Systems Design and Implementation\n(NSDI), pages 267\u2013280, 2011.\n\n[10] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. In\n\nInternational Conference on Machine Learning (ICML). 2003.\n\n[11] L. R. Rabiner. Readings in speech recognition. chapter A Tutorial on Hidden Markov Models\n\nand Selected Applications in Speech Recognition, pages 267\u2013296. 1990.\n\n[12] R. Ocana-Riola. Non-homogeneous markov processes for biomedical data analysis. Biometrical\n\nJournal, 47:369\u2013376, 2005.\n\n[13] R. Perez-Ocon, J. E. Ruiz-Castro, and M. L. Gamiz-Perez. Non-homogeneous markov models\nin the analysis of survival after breast cancer. Journal of the Royal Statistical Society, Series C,\n50:111\u2013124, 2001.\n\n[14] B. Chenand and X.-H. Zhou. Non-homogeneous markov process models with informative\nobservations with an application to alzheimer\u2019s disease. Biometrical Journal, 53(3):444\u2013463,\n2011.\n\n[15] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Neural Information\n\nProcessing Systems (NIPS), pages 25\u201332, 2003.\n\n[16] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for\ninterdependent and structured output spaces. In International Conference on Machine Learning\n(ICML), 2004.\n\n[17] A. McCallum, D. Freitag, and F. C. N. Pereira. Maximum entropy markov models for informa-\ntion extraction and segmentation. In International Conference on Machine Learning (ICML),\n2000.\n\n[18] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random \ufb01elds: Probabilistic\nmodels for segmenting and labeling sequence data. In International Conference on Machine\nLearning (ICML), pages 282\u2013289, 2001.\n\n[19] J. Peng, L. Bo, and J. Xu. Conditional neural \ufb01elds. In Neural Information Processing Systems\n\n(NIPS), pages 1419\u20131427. 2009.\n\n[20] C. Gupta, A. S. Suggala, A. Goyal, H. V. Simhadri, B. Paranjape, A. Kumar, S. Goyal, R. Udupa,\nM. Varma, and P. Jain. ProtoNN: compressed and accurate kNN for resource-scarce devices. In\nInternational Conference on Machine Learning (ICML), pages 1331\u20131340, 2017.\n\n10\n\n\f[21] A. Kumar, S. Goyal, and M. Varma. Resource-ef\ufb01cient machine learning in 2 kb ram for the\ninternet of things. In International Conference on Machine Learning (ICML), pages 1935\u20131944,\n2017.\n\n[22] V. K. Garg, O. Dekel, and L. Xiao. Learning small predictors. In Neural Information Processing\n\nSystems (NeurIPS), 2018.\n\n[23] P. Zhu, D. A. E. Acar, N. Feng, P. Jain, and V. Saligrama. Cost aware inference for iot devices.\n\nIn Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 2770\u20132779, 2019.\n\n[24] M. Narasimhan, P. Viola, and M. Shilman. Online decoding of markov models under latency\nconstraints. In International Conference on Machine Learning (ICML), pages 657\u2013664, 2006.\n[25] A. Backurs and C. Tzamos. Improving viterbi is hard: Better runtimes imply faster clique\nalgorithms. In International Conference on Machine Learning (ICML), volume 70, pages\n311\u2013321, 2017.\n\n[26] M. Cairo, G. Farina, and R. Rizzi. Decoding hidden markov models faster than viterbi via\nonline matrix-vector (max, +)-multiplication. In AAAI Conference on Arti\ufb01cial Intelligence\n(AAAI), pages 1484\u20131490, 2016.\n\n[27] R. \u0160r\u00e1mek, B. Brejov\u00e1, and T. Vina\u02c7r. On-line viterbi algorithm for analysis of long biological\nsequences. In Algorithms in Bioinformatics, pages 240\u2013251. Springer Berlin Heidelberg, 2007.\n[28] A. Churbanov and S. Winters-Hilt. Implementing em and viterbi algorithms for hidden markov\n\nmodel in linear memory. BMC Bioinformatics, 9(1):224, 2008.\n\n[29] Y. Lifshits, S. Mozes, O. Weimann, and M. Ziv-Ukelson. Speeding up hmm decoding and\n\ntraining by exploiting sequence repetitions. Algorithmica, 54(3):379\u2013399, 2009.\n\n[30] N. Kaji, Y. Fujiwara, N. Yoshinaga, and M. Kitsuregawa. Ef\ufb01cient staggered decoding for\nsequence labeling. In Association for Computational Linguistics (ACL), pages 485\u2013494, 2010.\n[31] H. Daum\u00e9, III and D. Marcu. Learning as search optimization: Approximate large margin\nmethods for structured prediction. In International Conference on Machine Learning (ICML),\npages 169\u2013176, 2005.\n\n[32] J. Bloit and X. Rodet. Short-time viterbi for online HMM decoding: Evaluation on a real-time\n\nphone recognition task. In ICASSP, pages 2121\u20132124. IEEE, 2008.\n\n[33] C. Y. Goh, J. Dauwels, N. Mitrovic, M. T. Asif, A. Oran, and P. Jaillet. Online map-matching\nbased on hidden markov model for real-time traf\ufb01c sensing applications. In IEEE Conference\non Intelligent Transportation Systems, 2012.\n\n[34] T. S. Jayram, T. Kimbrel, R. Krauthgamer, B. Schieber, and M. Sviridenko. Online server\nallocation in a server farm via bene\ufb01t task systems. In Symposium on Theory of Computing\n(STOC), pages 540\u2013549, 2001.\n\n[35] M. A. DeJesus and T. R. Ioerger. A hidden markov model for identifying essential and\ngrowth-defect regions in bacterial genomes from transposon insertion sequencing data. BMC\nBioinformatics, 2013.\n\n11\n\n\f", "award": [], "sourceid": 3042, "authors": [{"given_name": "Vikas", "family_name": "Garg", "institution": "MIT"}, {"given_name": "Tamar", "family_name": "Pichkhadze", "institution": "MIT"}]}