{"title": "Selective Prediction of Financial Trends with Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 855, "page_last": 863, "abstract": "Focusing on short term trend prediction in a financial context, we consider the problem of selective prediction whereby the predictor can abstain from prediction in order to improve performance. We examine two types of selective mechanisms for HMM predictors. The first is a rejection in the spirit of Chow\u2019s well-known ambiguity principle. The second is a specialized mechanism for HMMs that identifies low quality HMM states and abstain from prediction in those states. We call this model selective HMM (sHMM). In both approaches we can trade-off prediction coverage to gain better accuracy in a controlled manner. We compare performance of the ambiguity-based rejection technique with that of the sHMM approach. Our results indicate that both methods are effective, and that the sHMM model is superior.", "full_text": "Selective Prediction of Financial Trends with Hidden\n\nMarkov Models\n\nRan El-Yaniv and Dmitry Pidan\n\nDepartment of Computer Science, Technion\n\nHaifa, 32000 Israel\n\n{rani,pidan}@cs.technion.ac.il\n\nAbstract\n\nFocusing on short term trend prediction in a \ufb01nancial context, we consider the\nproblem of selective prediction whereby the predictor can abstain from prediction\nin order to improve performance. We examine two types of selective mechanisms\nfor HMM predictors. The \ufb01rst is a rejection in the spirit of Chow\u2019s well-known\nambiguity principle. The second is a specialized mechanism for HMMs that iden-\nti\ufb01es low quality HMM states and abstain from prediction in those states. We\ncall this model selective HMM (sHMM). In both approaches we can trade-off pre-\ndiction coverage to gain better accuracy in a controlled manner. We compare\nperformance of the ambiguity-based rejection technique with that of the sHMM\napproach. Our results indicate that both methods are effective, and that the sHMM\nmodel is superior.\n\n1\n\nIntroduction\n\nSelective prediction is the study of predictive models that can automatically qualify their own pre-\ndictions and output \u201cdon\u2019t know\u201d when they are not suf\ufb01ciently con\ufb01dent. Currently, manifestations\nof selective prediction within machine learning mainly exist in the realm of inductive classi\ufb01cation,\nwhere this notion is often termed \u2018classi\ufb01cation with a reject option.\u2019 In the study of a reject option,\nwhich was initiated more than 40 years ago by Chow [5], the goal is to enhance accuracy (or reduce\n\u2018risk\u2019) by compromising the coverage. For a classi\ufb01er or predictor equipped with a rejection mech-\nanism we can quantify its performance pro\ufb01le by evaluating its risk-coverage (RC) curve, giving\nthe functional relation between error and coverage. The RC curve represents a trade-off: the more\ncoverage we compromise, the more accurate we can expect to be, up to the point where we reject\neverything and (trivially) never err. The essence of selective classi\ufb01cation is to construct classi\ufb01ers\nachieving useful (and optimal) RC trade-offs, thus providing the user with control over the choice\nof desired risk (with its associated coverage compromise).\n\nOur longer term goal is to study selective prediction models for general sequential prediction tasks.\nWhile this topic has only been sparsely considered in the literature, we believe that it has great po-\ntential in dealing with dif\ufb01cult problems. As a starting point, however, in this paper we focus on the\nrestricted objective of predicting next-day trends in \ufb01nancial sequences. While limited in scope, this\nproblem serves as a good representative of dif\ufb01cult sequential data [17]. A very convenient and quite\nversatile modeling technique for analyzing sequences is the Hidden Markov Model (HMM). There-\nfore, the goal we set had been to introduce selection mechanisms for HMMs, capable of achieving\nuseful risk-coverage trade-off in predicting next-day trends.\n\nTo this end we examined two approaches. The \ufb01rst is a straightforward application of Chow\u2019s\nambiguity principle implemented with HMMs. The second is a novel and specialized technique\nutilizing the HMM state structure. In this approach we identify latent states whose prediction quality\nis systematically inferior, and abstain from predictions while the underlying source is likely to be in\n\n1\n\n\fthose states. We call this model selective HMM (sHMM). While this natural approach can work in\nprinciple, if the HMM does not contain suf\ufb01ciently many \u201c\ufb01ne grained\u201d states, whose probabilistic\nvolume (or \u201cvisit rate\u201d) is small, the resulting risk-coverage trade-off curve will be a coarse step\nfunction that will prevent \ufb01ne control and usability. One of our contributions is a solution to this\ncoarseness problem by introducing algorithms for re\ufb01ning sHMMs. The resulting re\ufb01ned sHMMs\ngive rise to smooth RC trade-off curves.\n\nWe present the results of quite extensive empirical study showing the effectiveness of our methods,\nwhich can increase the edge in predicting next-day trends. We also show the advantage of sHMMs\nover the classical Chow approach.\n\n2 Preliminaries\n\n2.1 Hidden Markov Models in brief\n\nA Hidden Markov Model (HMM) is a generative probabilistic state machine with latent states, in\nwhich state transitions and observations emissions represent \ufb01rst-order Markov processes. Given an\nobservation sequence, O = O1, . . . , OT , hypothesized to be generated by such a model, we would\nlike to \u201creverse engineer\u201d the most likely (in a Bayesian sense) state machine giving rise to O, with\nassociated latent state sequence S = S1, . . . , ST . An HMM is de\ufb01ned as \u03bb , hQ, M, \u03c0, A, Bi,\nwhere Q is a set of states, M is the number of observations, \u03c0 is the initial states distribution, \u03c0i ,\nP [S1 = qi], A = (aij) is the transition matrix, aij , P [St+1 = qj | St = qi], and B = (bj(k)) is\nthe observation emission matrix, bj(k) , P [Ot = vk | St = qj].\nGiven an HMM \u03bb and observation sequence O, an ef\ufb01cient algorithm for calculating P [O | \u03bb] is\nthe forward-backward procedure (see details in, e.g., Rabiner [16]). The estimation of the HMM\nparameters (training) is traditionally performed using a specialized expectation-maximization (EM)\nalgorithm called the Baum-Welch algorithm [2]. For a large variety of problems it is also essential to\nidentify the \u201cmost likely\u201d state sequence associated with a given observation sequence. This is com-\nmonly accomplished using the Viterbi algorithm [22], which computes arg maxS P [S | O, \u03bb]. Sim-\nilarly, one can identify the most likely \u201cindividual\u201d state, arg maxq P [St = q | O, \u03bb], corresponding\nto time t.\n\n2.2 Selective Prediction and the RC Trade-off\n\nTo de\ufb01ne the performance parameters in selective prediction we utilize the following de\ufb01nitions for\nselective classi\ufb01ers from [6, 7]. A selective (binary) classi\ufb01er is represented as a pair of functions\nhf, gi, where f is a binary classi\ufb01er and g : X \u2192 {0, 1} is a binary quali\ufb01er for f: whenever g(x) =\n1, the prediction f (x) is accepted, and otherwise it is ignored. The performance of a selective\nclassi\ufb01er is measured by its coverage and risk. Coverage is the expected volume of non-rejected\ndata instances, C , E [g(X)], (where expectation is w.r.t.\nthe unknown underlying distribution)\nand the risk is the error rate over non-rejected instances, R , E [I(f (X) 6= Y )g(X)](cid:14)C, where Y\nrepresents the true classi\ufb01cation.\nThe purpose of a selective prediction model is to provide \u201csuf\ufb01ciently low\u201d risk with \u201csuf\ufb01ciently\nhigh\u201d coverage. The functional relation between risk and coverage is called the risk coverage (RC)\ntrade-off. Generally, the user of a selective model would like to bound one measure (either risk or\ncoverage) and then obtain the best model in terms of the other measure. The RC curve of a given\nmodel characterizes this trade-off on a risk/coverage plane thus describing its full spectrum.\n\nA selective predictor is useful if its RC curve is \u201cnon trivial\u201d in the sense that progressively smaller\nrisk can be obtained with progressively smaller coverage. Thus, when constructing a selective classi-\n\ufb01cation or a prediction model it is imperative to examine its RC curve. One can consider theoretical\nbounds of the RC curve (as in [6]) or empirical ones as we do here. Interpolated RC curve can be\nobtained by selecting a number of coverage bounds at certain grid points of choice, and learning\n(and testing) a selective model aiming at achieving the best possible risk for each coverage level.\nObviously, each such model should respect the corresponding coverage bound.\n\n2\n\n\f3 Selective Prediction with HMMs\n\n3.1 Ambiguity Model\n\nThe \ufb01rst approach we consider is an implementation of the classical ambiguity idea. We construct an\nHMM-based classi\ufb01er, similar to the one used in [3], and endow it with a rejection mechanism in the\nspirit of Chow [5]. This approach is limited to binary labeled observation sequences. The training\nset, consisting of labeled sequences, is partitioned into its positive and negative instances, and two\nHMM\u2019s, \u03bb+ and \u03bb\u2212, are trained using those sets, respectively. Thus, \u03bb+ is trained to identify\npositively labeled sequences, and \u03bb\u2212 \u2013 negatively labeled sequences. Then, each new observation\nsequence O is classi\ufb01ed as sign(P [O | \u03bb+] \u2212 P [O | \u03bb\u2212]).\nFor applying Chow\u2019s ambiguity idea using the model (\u03bb+, \u03bb\u2212), we need to de\ufb01ne a measure C(O)\nof prediction con\ufb01dence for any observation sequence O. A natural choice in this context is to\nmeasure the log-likelihood difference between the positive and negative models, normalized by the\nlength of the sequence. Thus, we de\ufb01ne C(O) , | 1\nT (log P [O | \u03bb+] \u2212 log P [O | \u03bb\u2212])|, where T is\nthe length of O. The greater C(O) is, the more con\ufb01dent are we in the classi\ufb01cation of O. Now,\ngiven the classi\ufb01cation con\ufb01dences of all sequences in the training data set, and given a required\nlower bound on the coverage, an empirical threshold can be found such that a designated number of\ninstances with the smallest con\ufb01dence measures will be rejected. If our data is non-stationary (e.g.\n\ufb01nancial sequences), this threshold can be re-estimated at the arrival of every new data instance.\n\n3.2 State-Based Selectivity\n\nWe propose a different approach for implementing selective prediction with HMMs. The idea is to\ndesignate an appropriate subset of the states as \u201crejective.\u201d The proposed approach is suitable for\nprediction problems whose observation sequences are labeled. Speci\ufb01cally, for each observation,\nOt, we assume that there is a corresponding label lt. The goal is to predict lt at time t \u2212 1.\nEach state is assigned risk and visit rate estimates. For each state q, its risk estimate is used as\na proxy to the probability of making erroneous predictions from q, and its visit rate quanti\ufb01es the\nprobability of outputting any symbol from q. A subset of the highest risk states is selected so that\ntheir total expected visit rate does not exceed the user speci\ufb01ed rejection bound. These states are\ncalled rejective and predictions from them are ignored. The following two de\ufb01nitions formulate\nthese notions. We associate with each state q a label Lq representing the HMM prediction while at\nthis state (see Section 3.4). Denote \u03b3t(i) , P [St = qi | O, \u03bb], and note that \u03b3t(i) can be ef\ufb01ciently\ncalculated using the standard forward-backward procedure (see Rabiner [16]).\nDe\ufb01nition 3.1 (emprirical visit rate). Given an observation sequence, the empirical visit rate, v(i),\nof a state qi, is the fraction of time the HMM spends in state qi, that is v(i) , 1\nDe\ufb01nition 3.2 (empirical state risk). Given an observation sequence, the empirical risk, r(i), of a\nstate qi, is the rate of erroneous visits to qi, that is r(i) , 1\n\nt=1 \u03b3t(i).\n\nT PT\n\n\u03b3t(i).\n\nv(i)T PT\n\nt=1\n\nLqi\n\n6=lt\n\nSuppose we are required to meet a user speci\ufb01ed rejection bound 0 \u2264 B \u2264 1. This means that we\nare required to emit predictions (rather than \u201cdon\u2019t know\u201ds) in at least 1 \u2212 B fraction of the time. To\nachieve this we apply the following greedy selection procedure of rejective states whereby highest\nrisk states are sequentially selected as long as their overall visit rate does not exceed B. We call the\nresulting model Naive-sHMM. Formally, let qi1 , qi2 , . . . , qiN be an ordering of all states, such that\nfor each j < k, r(ij) \u2265 r(ik). Then, the rejective state subset is,\n\nRS , \uf8f1\uf8f2\n\uf8f3\n3.3 Overcoming Coarseness\n\nqi1 , . . . , qiK\n\nK\n\nX\n\nj=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nv(ij) \u2264 B,\n\nK+1\n\nX\n\nj=1\n\nv(ij) > B\uf8fc\uf8fd\n\uf8fe\n\n.\n\n(3.1)\n\nThe above simple approach suffers from the following coarseness problem. If our model does not\ninclude a large number of states, or includes states with very high visit rates (as it is often the case\nin applications), the total visit rate of the rejective states might be far from the requested bound\n\n3\n\n\fB, entailing that selectivity cannot be fully exploited. For example, consider a model that has\nthree states such that r(q1) > r(q2) > r(q3), v(q1) = \u0001, and v(q2) = B + \u0001. In this case only\nthe negligibly visited q1 will be rejected. We propose two methods to overcome this coarseness\nproblem. These methods are presented in the two subsequent sections.\n\n3.3.1 Randomized Linear Interpolation (RLI)\n\nIn the randomized linear interpolation (RLI) method, predictions from rejective states are always\nrejected, but predictions from the non-rejective state with the highest risk rate are rejected with\nappropriate probability, such that the total expected rejection rate equals the rejection bound B. Let\nq be the non-rejective state with the highest risk rate. The probability to reject predictions emerging\nv(q) (cid:16)B \u2212 Pq0\u2208RS v(q0)(cid:17). Clearly, with pq thus de\ufb01ned, the\nfrom this state is taken to be pq , 1\ntotal expected rejection rate is precisely B, when expectation is taken over random choices.\n\n3.3.2 Recursive Re\ufb01nement (RR)\n\nGiven an initial HMM model, the idea in the recursive re\ufb01nement approach is to construct an ap-\nproximate HMM whose states have \ufb01ner granularity of visit rates. This smaller granularity enables\na selection of rejective states whose total visit rate is closer to the required bound. The re\ufb01nement is\nachieved by replacing every highly visited state with a complete HMM.\nThe process starts with a root HMM, \u03bb0, trained in a standard way using the Baum-Welch algorithm.\nIn \u03bb0, states that have visit rate greater than a certain bound are identi\ufb01ed. For each such state qi\n(called a heavy state), a new HMM \u03bbi (called a re\ufb01ning HMM) is trained and combined with \u03bb0 as\nfollows: every transition from other states into qi in \u03bb0 entails a transition into (initial) state in \u03bbi in\naccordance with the initial state distribution of \u03bbi; every self transition to qi in \u03bb0 results in a state\ntransition in \u03bbi according to its state transition matrix; \ufb01nally, every transition from qi to another\nstate entails transition from a state in \u03bbi whose probability is the original transition probability from\nqi. States of \u03bbi are assigned the label of qi. This re\ufb01nement continues in a recursive manner and\nterminates when all the heavy states have re\ufb01nements. The non re\ufb01ned states are called leaf states.\n\n\u007f\n\n?\n\n\u007f\n\n?\n\n?\n\n_\n1\n_ \u007f\n\n?\n\n_\n4\n_ \u007f\n\n\u007f\n\n?\n\n?\n\n_\n2\n_ \u007f\n\nONML\nHIJK5\n\nONML\nHIJK6\n\nONML\nHIJK3\n\nONML\nHIJK7\n\nONML\nHIJK8\n\nFigure 1: Recursively Re\ufb01ned HMM\n\nFigure 1 depicts a recursively re\ufb01ned HMM having two re\ufb01nement levels. In this model, states 1,2,4\nare heavy (and re\ufb01ned) states, and states 3,5,6,7,8 are leaf (emitting) states. The model consisting\nof states 3 and 4 re\ufb01nes state 1, the model consisting of states 5 and 6 re\ufb01nes state 2, etc.\n\nAn aggregate state of the complete hierarchical model corresponds to a set of inner HMM states,\neach of which is a state on a path from the root through re\ufb01ning HMMs, to a leaf state. Only leaf\nstates actually emit symbols. Re\ufb01ned states are non-emitting and their role in this construction is to\npreserve the structure (and transitions) of the HMMs they re\ufb01ne.\n\nAt every time instance t, the model is at some aggregate state. Transition to the next aggregate state\nalways starts at \u03bb0, and recursively progresses to the leaf states, as shown in the following example.\nSuppose that the model in Figure 1 is at aggregate state {1,4,7} at time t. The aggregate state at time\nt + 1 is calculated as follows. \u03bb0 is in state 1, so its next state (say 1 again) is chosen according to\nthe distribution {a11, a12}. We then consider the model that re\ufb01nes state 1, which was in state 4 at\n\n4\n\n\u001f\n\u001f\n\u0011\n\u0011\n+\n+\n\u007f\n\u007f\n\u000f\n\u000f\n\u001f\n\u001f\n\u0011\n\u0011\nk\nk\n\u001f\n\u001f\n\u000f\n\u000f\n\u0011\n\u0011\n+\n+\n\u001f\n\u001f\n\u0011\n\u0011\nk\nk\n\u001f\n\u001f\n\u000f\n\u000f\n\u0011\n\u0011\n+\n+\n\u0011\n\u0011\nk\nk\n\u0011\n\u0011\n+\n+\n\u0011\n\u0011\nk\nk\n\ftime t. Here again the next state (say 3) is chosen according to the distribution {a43, a44}. State 3\nis a leaf state that emits observations, and the aggregate state at time t + 1, is {1,3}. On the other\nhand, if state 2 is chosen at the root, a new state (say 6) in its re\ufb01ning model is chosen according to\nthe initial distribution {\u03c05, \u03c06} (transition into the heavy state from another state). The chosen state\n6 is a leaf state so the new aggregate state becomes {2,6}.\n\nAlgorithm 1 TrainRe\ufb01ningHMM\nInput: HMM \u03bb = h{qj}j=n\n1: Draw random HMM, \u03bbi = h{qj}j=n+N\nj,k=n+1 , {bjm}j=n+N,m=M\nj=n+1,m=1 i\n2: For each 1 \u2264 j \u2264 n, j 6= i, replace transition qjqi with qjqn+1 . . . qjqn+N , and qiqj with\n\nj=1 , M, \u03c0, A, Bi, heavy state qi, O\nj=n+1 , M, {\u03c0j}j=n+N\n\nj=n+1 , {ajk}j,k=n+N\n\nqn+1qj . . . qn+N qj\n\n3: Remove state qi with the corresponding {bim}i=M\ni=1\n\n\u03bbi. Set Lqj = Lqi for each n + 1 \u2264 j \u2264 n + N\n\nfrom \u03bb, and record it as a state re\ufb01ned by\n\nFor each 1 \u2264 j \u2264 n, j 6= i, update aj(n+k) = aji\u03c0n+k, and a(n+k)j = aij, 1 \u2264 k \u2264 N.\nFor each n + 1 \u2264 j \u2264 n + N, update \u03c0j = \u03c0i\u03c0j\nFor each n + 1 \u2264 j, k \u2264 n + N, update ajk = aiiajk\nRe-estimate {\u03c0j}j=n+N\n\nj,k=n+1 , {bjm}j=n+N,m=M\n\nj=n+1,m=1 , using Eq.(3.2)\n\nj=n+1 , {ajk}j,k=n+N\n\n4: while not converged do\n5:\n6:\n7:\n8:\n9: end while\n10: Perform steps 5-7\nOutput: HMM \u03bb\n\nAlgorithm 1 is a pseudocode of the training algorithm for re\ufb01ning HMM \u03bbi, for a heavy state qi.\nThis algorithm is an extension of the Baum-Welch algorithm [2].\nIn steps 1-3, a random \u03bbi is\ngenerated and connected to the HMM \u03bb instead of qi. Steps 5-8 iteratively update the parameters\nof \u03bbi until the Baum-Welch convergence criterion is met, and in step 10, \u03bb is updated with the \ufb01nal\n\u03bbi parameters. Finally, in step 3 qi is stored as a state re\ufb01ned by \u03bbi, to preserve the hierarchical\nstructure of the resulting model (essential for the selection mechanism). The algorithm is applied on\nheavy states until all states in the HMM have visit rates lower than a required bound.\n\n\u03c0j =\n\n1\nZ\n\n\uf8eb\n\uf8ec\uf8ed\n\n\u03b31(j) +\n\nT \u22121\n\nn\n\nX\n\nt=1\n\nX\n\nk=1\nk6=i\n\n\u03bet(k, j)\n\n\uf8f6\n\uf8f7\uf8f8\n\n, ajk =\n\n\u03bet(j, k)\n\nT \u22121\n\nPt=1\nPl=n+1\n\nn+N\n\n, bjm =\n\n\u03bet(j, l)\n\nT \u22121\n\nPt=1\n\nT\n\nPt=1\n\nOt=m\n\n\u03b3t(j)\n\nT\n\nPt=1\n\n\u03b3t(j)\n\n(3.2)\n\nIn Eq. (3.2), re-estimation formulas for the parameters of newly added states (Step 8) are presented,\nwhere \u03bet(j, k) = P [qt = j, qt+1 = k | O, \u03bb]. It is easy to see that, similarly to original Baum-Welch\nformulas, constraints for the parameters to be valid distributions are preserved (Z is a normalization\nfactor in the \u03c0j equation). The main difference from the original formulas is in the re-estimation of\n\u03c0j: in the re\ufb01nement process, transitions from other states into heavy state qi also affect the initial\ndistribution of its re\ufb01ning states.\n\nThe most likely aggregate state at time t, given sequence O, is found in a top-down manner using\nthe hierarchical structure of the model. Starting with the root model, \u03bb0, the most likely individual\nstate in it, say qi, is identi\ufb01ed. If this state has no re\ufb01nement, then we are done. Otherwise, the most\nlikely individual state in \u03bbi (HMM that re\ufb01nes qi), say qj, is identi\ufb01ed, and the aggregate state is\nupdated to be {qi, qj}. The process continues until the last discovered state has no re\ufb01nement.\nThe above procedure requires calculation of the quantity \u03b3t(i) not only for the leaf states (where it is\ncalculated using a standard forward-backward procedure), but also for the re\ufb01ned states. For those\nstates, \u03b3t(i) = PN\nThe rejection subset is found using the Eq. (3.1), applied to the aggregate states of the re\ufb01ned model.\nVisit and risk estimates for the aggregate state {qi1 . . . qik } are calculated using \u03b3t(ik), of a leaf state\nqik that identi\ufb01es this aggregate state.\n\n\u03b3t(j) is calculated recursively over the hierarchical structure.\n\nqj re\ufb01nes qi\n\nj=1\n\n5\n\n\fThe outcome of the RR procedure is a tree of HMMs whose main purpose is to redistribute visit rates\namong states. This re-distribution is the key element that allows for achieving smooth RC curves.\nVarious other hierarchical HMM schemes have been proposed in the literature [4, 8, 10, 18, 20].\nWhile some of these schemes may appear similar to ours at \ufb01rst glance, they do not address the visit\nrate re-distribution objective. In fact, those models were developed to serve other purposes such as\nbetter modeling of sequences that have special structure (e.g., sequences hypothesized to be emerged\nfrom a hierarchical generative model).\n\n3.4 State Labeling\n\nIt remains to address the assignment of labels to the states in our state-based selection models. Labels\ncan be assigned to states a-priori, and then a supervised EM method can be used for training (this\nmodel is known as Class HMM), as in [15]. Alternatively, state labels can be calculated from the\nstatistics of the states, if an unsupervised training method is used. In our setting, we are following\nthe latter approach. For a state qi, and given observation label l, we calculate the average number of\nvisits (at qi) whose corresponding label is l, as E [St = qi | lt = l, O, \u03bb] = P1\u2264t\u2264T ,lt=l \u03b3t(i). Thus,\nLqi is chosen to be an l that maximized this quantity.\n\n4 Experimental Results\n\nWe compared empirically the four selection mechanisms presented in Section 3, namely, the ambi-\nguity model and the Naive, RLI, and RR sHMMs. All methods were compared on a next-day trend\nprediction of the S&P500 index. This problem is known to be very dif\ufb01cult, and recent experimental\nwork by Rao and Hong [17] assessed that although HMM succeeds to achieve some positive edge,\nthe accuracy is near \ufb01fty-\ufb01fty (51.72%) when a pure price data is used.\n\nFor our prediction task, we took as observation sequence directions of the S&P500 price changes.\nSpeci\ufb01cally, the direction dt, at time t, is dt , sign(pt+1 \u2212 pt), where pt are close prices. The state-\nbased models were fed with the series of partial sequences ot , dt\u2212`+1, . . . , dt. For the ambiguity\nmodel, the partial sequences dt\u2212`+1, . . . , dt were used as a pool of observation sequences.\nIn a preliminary small experiment we observed the advantage of the state-based approach over the\nambiguity model. In order to validate this, we tried to falsify this hypothesis by optimizing the\nhyper-parameters of the ambiguity model in hindsight.\n\nFor the state-based models we used a 5-state HMM, and predictions were made using the label of\nthe most likely individual state. Such HMMs are hypothesized to be suf\ufb01ciently expressive to model\na small number of basic market conditions such as strong/weak trends (up and down) and sideways\nmarkets [17, 23]. We have not tried to optimize this basic architecture and better results with more\nexpressive models can be possibly achieved. For the ambiguity model we constructed two 8-state\nHMMs, where the length of a single observation sequence (`) is 5. This architecture was optimized\nin hindsight among all possibilities of up to 10 states, and up to length 8, for a single observation\nsequence. Every re\ufb01ning model in the RR procedure had the same structure, and the upper bound\non the visit rate was \ufb01xed at 0.1. For sHMMs, the hyper-parameter ` was arbitrarily set to 3 (there\nis possibly room for further improvement by optimizing the model w.r.t. this hyper-parameter).\n\nRC curves were computed for each technique by taking the linear grid of rejection rate bounds from\n0 to 0.9 in steps of 0.1. For each bound, every model was trained and tested using 30-fold cross-\nvalidation, with each fold consisting of 10 random restarts. Test performance was measured by\nmean error rate, taken over the 30 folds, and standard error of the mean (SEM) statistics were also\ncalculated to monitor statistical signi\ufb01cance.\n\nSince the price sequences we deal with are highly non-stationary, we employed a walk-forward\nscheme in which the model is trained over the window of past Wp returns and then tested on the\nsubsequent window of Wf \u201cfuture\u201d returns. Then, we \u201cwalk forward\u201d Wf steps (days) in the return\nsequence (so that the next training segment ends where the last test segment ended) and the process\nrepeats until we consume the entire data sequence.\nIn the experiments we set Wp = 2000 and\nWf = 50 (that is, in each step we learn to predict the next business quarter, day by day). The data\nsequence in this experiment consisted of the 3000 S&P500 returns from 1/27/1999 to 12/31/2010.\nWith our walk forward procedure, the \ufb01rst 2000 points were only used for training the \ufb01rst model.\n\n6\n\n\fe\n\nt\n\na\nR\n\n \nr\no\nr\nr\n\nE\n\n0.5\n\n0.48\n\n0.46\n\n0.44\n\n0.42\n\n0.4\n\n \n\n1.Ambiguity\n\n \n\nV\u2212STACKS\nSpectral Algorithm\n\n2.Naive\n\n4.RR\n\n3.RLI\n\n1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1\n\nCoverage Bound\n\nBound Amb. Naive\n0.999\n0.939\n0.778\n0.719\n0.633\n0.507\n0.385\n0.305\n0.199\n\n0.889\n0.796\n0.709\n0.616\n0.516\n0.440\n0.337\n0.256\n0.168\n\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n\nRLI\n0.899\n0.798\n0.696\n0.593\n0.491\n0.391\n0.291\n0.192\n0.094\n\nRR\n0.942\n0.842\n0.735\n0.628\n0.526\n0.423\n0.324\n0.224\n0.131\n\n(a) Error rate vs coverage bound\n\n(b) Coverage rate vs coverage bound\n\nFigure 2: S&P500 RC-curves\n\nFigure 2a shows that all four methods exhibited meaningful RC-curves; namely, the error rates\ndecreased monotonically with decreasing coverage bounds. The RLI and RR models (curves 3 and 4,\nrespectively) outperformed the Naive one (curve 2), by better exploiting the allotted coverage bound,\nas is evident from Table 2b. In addition, the RR model outperformed the RLI model, and moreover,\nits effective coverage is higher for every required coverage bound. This validates the effectiveness\nof the RR approach that implements a smarter selection process than the RLI model. Speci\ufb01cally,\nwhen RR re\ufb01nes a state and the resulting sub-states have different risk rates, the selection procedure\nwill tend to reject riskier states \ufb01rst. Comparing the state-based models (curves 2-4) to the ambiguity\nmodel (curve 1), we see that all the state-based models outperformed the ambiguity model through\nthe entire coverage range (despite the advantage we provided to the ambiguity model).\n\nWe also compared our models to two alternative HMM learning methods that were recently pro-\nposed: the spectral algorithm of Hsu et al. [13], and the V-STACKS algorithm of Siddiqi et al. [20].\nAs can be seen in Figure 2a, the selective techniques can also improve the accuracy obtained by\nthese methods (with full coverage).\n\nQuantitatively very similar results were also obtained in a number of other experiments (not pre-\nsented, due to lack of space) with continuous data (without discretization) of the S&P500 index and\nof Gold, represented by its GLD exchange traded fund (ETF) replica.\n\n6000\n\n4000\n\n2000\n\ns\ne\nc\nn\na\nt\ns\nn\n\ni\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\n2000\n\n1500\n\n1000\n\n500\n\ns\ne\nc\nn\na\nt\ns\nn\n\ni\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\n0\n\n\u22120.2\n\n\u22120.1\n\n0\n\nDifference\n(a) Visit\n\n0.1\n\n0.2\n\n0\n\u22121\n\n\u22120.5\n\n0.5\n\n1\n\n0\n\nDifference\n(b) Risk\n\nFigure 3: Distributions of visit and risk train/test differences\n\nFigure 3a depicts the distribution of differences between empirical visit rates, measured on the train-\ning set, and those rates on the test set. It is evident that this distribution is symmetric and con-\ncentrated around zero. This means that our empirical visit estimates are quite robust and useful.\nFigure 3b depicts a similar distribution, but now for state risks. Unfortunately, here the distribution\nis much less concentrated, which means that our naive empirical risk estimates are rather noisy.\nWhile the distribution is symmetric about zero (and underestimates are often compensated by over-\nestimates) it indicates that these noisy measurements are a major bottleneck in achieving better error\nrates. Therefore, it would be very interesting to consider more sophisticated risk estimation methods.\n\n7\n\n\f5 Related Work\n\nSelective classi\ufb01cation was introduced by Chow [5], who took a Bayesian route to infer the optimal\nrejection rule and analyze the risk-coverage trade-off under complete knowledge of the underlying\nprobabilistic source. Chow\u2019s Bayes-optimal policy is to reject instances whenever none of the pos-\nteriori probabilities are suf\ufb01ciently predominant. While this policy cannot be explicitly applied in\nagnostic settings, it marked a general ambiguity-based approach for rejection strategies. There is\na substantial volume of research contributions on selective classi\ufb01cation where the main theme is\nthe implementation of reject mechanisms for particular classi\ufb01er learning algorithms like support\nvector machines, see, e.g., [21]. Most of these mechanisms can be viewed as variations of the Chow\nambiguity-based policy. The general consensus is that selective classi\ufb01cation can often provide\nsubstantial error reductions and therefore rejection techniques have found good use in numerous\napplications, see, e.g., [12]. Rejection mechanisms were also utilized in [14] as a post-processing\noutput veri\ufb01er for HMM-based recognition systems There have been also a few theoretical studies\nproviding worst case high probability bounds on the risk-coverage trade-off; see, .e.g., [1, 6, 7, 9].\n\nHMMs have been extensively studied and used both theoretically and in numerous application areas.\nIn particular, \ufb01nancial modeling with HMMs has been considered since their introduction by Baum\net al. While a complete survey is clearly beyond our scope here, we mention a few related results.\nHamilton [11] introduced a regime-switching model, in which the sequence is hypothesized to be\ngenerated by a number of hidden sources, or regimes, whose switching process is modeled by a (\ufb01rst-\norder) Markov chain. Later, in [19] a hidden Markov model of neural network \u201cexperts\u201d was used for\nprediction of half-hour and daily price changes of the S&P500 index. Zhang [23] applied this model\nfor predicting S&P500 next day trends, employing mixture of Gaussians in the states. The latter two\nworks reported on prominent results in terms of cumulative pro\ufb01t. The recent experimental work by\nRao and Hong [17] evaluated HMMs for a next-day trend prediction task and measured performance\nin terms of accuracy. They reported on a slight but consistent positive prediction edge.\n\nIn [3], an HMM-based classi\ufb01er was proposed for \u201creliable trends,\u201d de\ufb01ned to be specialized 15 day\nreturn sequences that end with either \ufb01ve consecutive positive or consecutive negative returns. A\nclassi\ufb01er was constructed using two HMMs, one trained to identify upward (reliable) trends and the\nother, for downward (reliable) trends. Non-reliable sequences are always rejected. Therefore, this\ntechnique falls within selective prediction but the selection function has been manually prede\ufb01ned.\n\n6 Concluding Remarks\n\nThe structure and modularity of HMMs make them particularly convenient for incorporating selec-\ntive prediction mechanisms. Indeed, the proposed state-based method can result in a smooth and\nmonotonically decreasing risk-coverage trade-off curve that allows for some control on the desired\nlevel of selectivity. We focused on selective prediction of trends in \ufb01nancial sequences. For these\ndif\ufb01cult prediction tasks our models can provide non-trivial prediction improvements. We expect\nthat the relative advantage of these selective prediction techniques will be higher in easier tasks, or\neven in the same task by utilizing more elaborate HMM modeling, perhaps including other sources\nof specialized information including prices of other correlated indices.\n\nWe believe that a major bottleneck in attaining smaller test errors is the noisy risk estimates we obtain\nfor the hidden states (see Figure 3b). This noise is partly due to the noisy nature of our prediction\nproblem, but may also be attributed to the simplistic approach we took in estimating empirical risk.\nA challenging problem would be to incorporate more robust estimates in our mechanism, which\nis likely to enable better risk-coverage trade-offs. Finally, it will be very interesting to examine\nselective prediction mechanisms in the more general context of Bayesian networks and other types\nof graphical models.\n\nAcknowledgements\n\nThis work was supported in part by the IST Programme of the European Community, under the\nPASCAL2 Network of Excellence, IST-2007-216886. This publication only re\ufb02ects the authors\u2019\nviews.\n\n8\n\n\fReferences\n\n[1] P. L. Bartlett and M. H. Wegkamp. Classi\ufb01cation with a reject option using a hinge loss.\n\nJournal of Machine Learning Research, 9:1823\u20131840, 2008.\n\n[2] L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring in the\nstatistical analysis of probabilistic functions of markov chains. The Annals of Mathematical\nStatistics, 41(1):164\u2013171, 1970.\n\n[3] M. Bicego, E. Grosso, and E. Otranto. A Hidden Markov Model approach to classify and\n\npredict the sign of \ufb01nancial local trends. SSPR, 5342:852\u2013861, 2008.\n\n[4] M. Brand. Coupled Hidden Markov Models for modeling interacting processes. Technical\n\nReport 405, MIT Media Lab, 1997.\n\n[5] C. Chow. On optimum recognition error and reject tradeoff. IEEE-IT, 16:41\u201346, 1970.\n[6] R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classi\ufb01cation. JMLR,\n\n11:1605\u20131641, May 2010.\n\n[7] R. El-Yaniv and Y. Wiener. Agnostic selective classi\ufb01cation. In NIPS, 2011.\n[8] S. Fine, Y. Singer, and N. Tishby. The Hierarchical Hidden Markov Model: Analysis and\n\nApplications. Machine Learning, 32(1):41\u201362, 1998.\n\n[9] Y. Freund, Y. Mansour, and R. E. Schapire. Generalization bounds for averaged classi\ufb01ers.\n\nAnnals of Statistics, 32(4):1698\u20131722, 2004.\n\n[10] Z. Ghahramani and M. I. Jordan. Factorial Hidden Markov Models. Machine Learning, 29(2\u2013\n\n3):245\u2013273, 1997.\n\n[11] J. Hamilton. Analysis of time series subject to changes in regime. Journal of Econometrics,\n\n45(1\u20132):39\u201370, 1990.\n\n[12] B. Hanczar and E. R. Dougherty. Classi\ufb01cation with reject option in gene expression data.\n\nBioinformatics, 24:1889\u20131895, 2008.\n\n[13] D. Hsu, S. Kakade, and T. Zhang. A spectral algorithm for learning Hidden Markov Models.\n\nIn COLT, 2009.\n\n[14] A. L. Koerich. Rejection strategies for handwritten word recognition. In IWFHR, 2004.\n[15] A. Krogh. Hidden Markov Models for labeled sequences. In Proceedings of the 12th IAPR\n\nICPR\u201994, pages 140\u2013144, 1994.\n\n[16] L. R. Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recog-\n\nnition. Proceedings of the IEEE, 77(2), February 1989.\n\n[17] S. Rao and J. Hong. Analysis of Hidden Markov Models and Support Vector Machines in \ufb01nan-\ncial applications. Technical Report UCB/EECS-2010-63, Electrical Engineering and Computer\nSciences University of California at Berkeley, 2010.\n\n[18] L. K. Saul and M. I. Jordan. Mixed memory Markov models: Decomposing complex stochastic\n\nprocesses as mixtures of simpler ones. Machine Learning, 37:75\u201387, 1999.\n\n[19] S. Shi and A. S. Weigend. Taking time seriously: Hidden Markov Experts applied to \ufb01nancial\n\nengineering. In IEEE/IAFE, pages 244\u2013252. IEEE, 1997.\n\n[20] S. Siddiqi, G. Gordon, and A. Moore. Fast State Discovery for HMM Model Selection and\n\nLearning. In AI-STATS, 2007.\n\n[21] F. Tortorella. Reducing the classi\ufb01cation cost of support vector classi\ufb01ers through an ROC-\n\nbased reject rule. Pattern Anal. Appl., 7:128\u2013143, 2004.\n\n[22] A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding\n\nalgorithm. IEEE-IT, 13(2):260\u2013269, 1967.\n\n[23] Y. Zhang. Prediction of \ufb01nancial time series with Hidden Markov Models. Master\u2019s thesis,\n\nThe School of Computing Science, Simon Frazer University, Canada, 2004.\n\n9\n\n\f", "award": [], "sourceid": 568, "authors": [{"given_name": "Dmitry", "family_name": "Pidan", "institution": null}, {"given_name": "Ran", "family_name": "El-Yaniv", "institution": null}]}