{"title": "Probabilistic Belief Revision with Structural Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 1036, "page_last": 1044, "abstract": "Experts (human or computer) are often required to assess the probability of uncertain events. When a collection of experts independently assess events that are structurally interrelated, the resulting assessment may violate fundamental laws of probability. Such an assessment is termed incoherent. In this work we investigate how the problem of incoherence may be affected by allowing experts to specify likelihood models and then update their assessments based on the realization of a globally-observable random sequence.", "full_text": "Probabilistic Belief Revision with Structural\n\nConstraints\n\nPeter B. Jones\n\nMIT Lincoln Laboratory\nLexington, MA 02420\njonep@ll.mit.edu\n\nVenkatesh Saligrama\n\nDept. of ECE\n\nBoston University\nBoston, MA 02215\n\nsrv@bu.edu\n\nAbstract\n\nSanjoy K. Mitter\nDept. of EECS\n\nMIT\n\nCambridge, MA 02139\nmitter@mit.edu\n\nExperts (human or computer) are often required to assess the probability of un-\ncertain events. When a collection of experts independently assess events that are\nstructurally interrelated, the resulting assessment may violate fundamental laws of\nprobability. Such an assessment is termed incoherent. In this work we investigate\nhow the problem of incoherence may be affected by allowing experts to specify\nlikelihood models and then update their assessments based on the realization of a\nglobally-observable random sequence.\nKeywords: Bayesian Methods, Information Theory, consistency\n\n1 Introduction\n\nCoherence is perhaps the most fundamental property of probability estimation. Coherence will be\nformally de\ufb01ned later, but in essence a coherent probability assessment is one that exhibits logical\nconsistency.\nIncoherent assessments are those that cannot be correct, that are at odds with the\nunderlying structure of the space, and so can\u2019t be extended to a complete probability distribution [1,\n2]. From a decision theoretic standpoint, treating assessments as odds, incoherent assessments result\nin guaranteed losses to assessors. They are dominated strategies, meaning that for every incoherent\nassessment there is a coherent assessment that uniformly improves the outcome for the assessors.\nDespite this fact, expert assessments (human and machine) are vulnerable to incoherence [3].\nPrevious authors have used coherence as a tool for fusing distributed expert assessments [4, 5, 6].\nThe focus has been on static coherence in which experts are polled once about some set of events\nand the responses are then fused through a geometric projection. Besides relying on arbitrary scor-\ning functions to de\ufb01ne the \u201cright\u201d projection, such analyses don\u2019t address dynamically evolving\nassessments or forecasts. This paper is, to our knowledge, the \ufb01rst attempt to analyze the problem of\ncoherence under Bayesian belief dynamics. The importance of dynamic coherence is demonstrated\nin the following example.\nConsider two uncertain events A1 and A2 where A1 \u2286 A2 (e.g. A2 = {NASDAQ \u2191 tomorrow}\nand A1 = {NASDAQ \u2191 tomorrow \u2265 10 points}). To be coherent, a probability assessment must\nobey the relation P (A1) \u2264 P (A2). For the purposes of the example, suppose the initial belief is\nP (A1) = P (A2) = 0.5 which is coherent. Next, suppose there is some binary random variable\nThis work was sponsored by the U.S. Government under Air Force Contract FA8721-05-C-0002. Opinions,\ninterpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by\nthe United States Government\n\n1\n\n\fZ that is believed to correlate with the underlying event (e.g. Z = 11{Google\u2191today} where 11 is an\nindicator function). The believed dependence between Z and Ai is captured by a likelihood model\nP (Z|Ai) that gives the probability of observing Z when event Ai does or doesn\u2019t occur. For the\nexample, suppose Z = 0 and the believed likelihoods are P (Z = 0|A1) = 1 and P (Z = 0| \u00afA1) and\nP (Z = 0|A2) = P (Z = 0| \u00afA2) = 0.5 where \u00afA is the complement of A. There\u2019s nothing inherently\nirrational in this belief model, but when Bayes\u2019 Rule is applied, it gives P (A1|Z = 0) = 0.67 >\nP (A2|Z = 0) = 0.5. The belief update has introduced incoherence!\n\n1.1 Motivating Example\n\nConcerned with their network security, BigCorps wants to purchase an Intrusion Detection and Pre-\nvention System (IDPS). They have two options, IDPS1 and IDPS2. IDPS1 detects both distributed\ndenial of service (DDoS) attacks and port scan (PS) attacks, while IDPS2 detects only DDoS attacks.\nWhile studying the NIST guide to IDPSs [7], BigCorps\u2019 CTO notes the recommendation that \u201dorga-\nnizations should consider using multiple types of IDPS technologies to achieve more comprehensive\nand accurate detection and prevention of malicious activity.\u201d Following the NIST recommendation,\nBigCorps purchases both IDPSs and sets them to work monitoring network traf\ufb01c.\nOne morning while reading the output reports of the two detectors, an intrepid security analyst\nwitnesses an interesting behavior. IDPS2 is registering an attack probability of 0.1 while detector\nIDPS1 is reading an attack probability of 0.05. Since the threats detected by IDPS1 are a superset\nof those detected by IDPS2, the probability assigned by IDPS1 should always be larger than that\nassigned by IDPS2. The dilemma faced by our analyst is how to reconcile the logically incoherent\noutputs of the two detectors. Particularly, how to ascribe probabilities in a way that is logically\nconsistent, but still retains as much as possible the expert assessments of the detectors.\n\n1.2 Contributions of this Work\n\nThis work introduces the concept of dynamic coherence, one that has not been previously treated\nin the literature. We suggest two possible forms of dynamic coherence and analyze the relationship\nbetween them. They are implemented and compared in a simple network modeling simulation.\n\n1.3 Previous Work\n\nPrevious authors have analyzed coherence with respect to contingent (or conditional) probability\nassessments [8, 9, 10]. These developments attempt to determine conditions characterizing coherent\nsubjective posteriors. While likelihood models are a form of contingent probability assessment, this\npaper goes further in analyzing the impact of these assessments on coherent belief dynamics.\nIn [11, 12] a different form of conditional coherence is suggested which derives from coherence of a\njoint probability distribution over observations and states of nature. It is shown that for this stronger\nform of conditional coherence, certain specially structured event sets and likelihood functions will\nproduce coherent posterior assessments.\nLogical consistency under non-Bayesian belief dynamics has been previously analyzed.\nIn [13]\nconditions for invariance under permutations of the observational sequence under Jeffrey\u2019s rule are\ndeveloped. A comparison of Jeffrey\u2019s rule and Pearl\u2019s virtual evidence method is made in [14] which\nshows that the virtual evidence method implicitly assumes the conditions of Jeffrey\u2019s update rule.\n\n2 Model\nLet \u2126 = {\u03c91, \u03c92, . . .} be an event space and (\u2126,F) a measurable space. Let \u03b8 : \u2126 \u2192 \u0398 be a\nmeasurable random variable; consider \u0398 = {\u03b81, \u03b8j, . . . , \u03b8J} to be the set of all possible \u201cstates of\nthe world.\u201d Also, let Zi : \u2126 \u2192 Z be a sequence of measureable random variables; consider Zi to\nbe the sequence of observations, with Z = {z1, z2, . . . , zK} and K < \u221e. Let \u2126\u03b8 (resp. \u2126Zi) be\n\n2\n\n\fthe pre-image of \u03b8 (resp. Zi). Since the random variables are assumed measureable, \u2126\u03b8 and \u2126Zi are\nmeasureable sets (i.e. elements of F), as are their countable intersections and unions.\ni }\u2126\u03b8 and let A = {Ai}. We\nFor i = 1, 2, . . . , N, let A\u03b8\ncall elements of A events under assessment. The characteristic matrix \u03c7 for the events under\nassessment is de\ufb01ned as\n\ni be a subset of \u0398, let Ai = \u222a{\u03b8\u2208A\u03b8\n\n(cid:189)\n\n\u03c7ij =\n\n\u03b8j \u2208 A\u03b8\no.w.\n\ni\n\n.\n\n1\n0\n\n(cid:163)\n\nAn individual probability assessment P : A \u2192 [0, 1] maps each event under assessment to the\nunit interval. In an abuse of notation, we will let P (cid:44)\n(joint) probability assessment. A coherent assessment (i.e. one that is logically consistent) can be\ndescribed geometrically as lying in the convex hull of the columns of \u03c7, meaning \u2203\u03bb \u2208 [0, 1]J s.t.\n\n\u00b7\u00b7\u00b7 P (AN )\n\nP (A1) P (A2)\n\n(cid:80)\n\ni \u03bbi = 1 and P = \u03c7\u03bb.\n\n(cid:164)T be a\n\nWe now consider a sequence of probability assessments Pn de\ufb01ned as follows: Pn is the result of\na belief revision process based on an initial probability assessment P0, a likelihood model pn(z|A),\nand a sequence of observations Z1, Z2, . . . , Zn. A likelihood model pn(z|A) is a pair of probability\nmass functions over the observations: one conditioned on A and the other conditioned on \u00afA (where\n\u00afA denotes the complement of A). We will make the simplifying assumption that the likelihood\nmodel is static, i.e. pn(z|A) = p(z|A) and pn(z| \u00afA) = p(z| \u00afA) for all n.\nIn this paper we assume belief revision dynamics governed by Bayes\u2019 rule, i.e.\n1\n\np(zn+1|A) \u2217 Pn\n\nPn+1 =\n\np(zn+1|A) \u2217 Pn + p(zn+1| \u00afA) \u2217 (1 \u2212 Pn)\n\n=\n\n1 + p(zn+1| \u00afA)\np(zn+1|A)\n\n1\u2212Pn\nPn\n\nTo simplify development, denote p(z = zi|Aj) = \u03b1ij and p(z = zi| \u00afAj) = \u03b2ij and assume \u2200j, \u2203i\ns.t. \u03b1ij (cid:54)= \u03b2ij (i.e. each event has at least one informative observation) and \u03b1ij \u2208 (0, 1), \u03b2ij \u2208 (0, 1)\nfor all i, j (i.e. no observation determines absolutely whether any event obtains). Then by induction\nthe posterior probability of event A after n observations is:\n\nPn(Aj) =\n\n1 + 1\u2212P0\n\nP0\n\n(cid:179)\n\n(cid:81)K\n\n1\n\ni=1\n\n\u03b2ij\n\u03b1i\n\n(cid:180)ni\n\n(1)\n\n(2)\n\nwhen ni is the number of observations zi.\n\n3 Probability convergence for single assessors\n\nFor a single assessor revising his estimate of the likelihood of event A, let the probability model\nbe given by p(z = zi|A) = \u03b1i and p(z = zi| \u00afA) = \u03b2i. It is convenient to rewrite (1) in terms of\nthe ratio \u03c1i = ni\nn and for simplicity assuming P0 = 0.5 (although the analysis holds for general\nP0 \u2208 (0, 1)). Substituting yields\n\n(cid:104)(cid:81)K\n\n(cid:179)\n\n1\n\ni=1\n\n\u03b2i\n\u03b1i\n\n(cid:180)\u03c1i(cid:105)n\n\nPn =\n\n1 +\n\nNote that 1) \u03c1 is the empirical distribution over the observations, and so converges almost surely\n(a.s.) to the true generating distribution, and 2) the convergence properties of Pn are determined by\nthe quantity between the square brackets in (2). Speci\ufb01cally, let\n\n(cid:182)\u03c1i\n\n(cid:181)\n\nK(cid:89)\n\ni=1\n\n\u03b2i\n\u03b1i\n\nL\u221e = lim\nn\u2192\u221e\n\nL\u221e is commonly referred to as the likelihood ratio, familiar from classical binary hypothesis testing.\nSince \u03c1 converges a.s. and the function is continuous, L\u221e exists a.s. If L\u221e < 1 then Pn \u2192 1; if\nL\u221e > 1 then Pn \u2192 0; if L\u221e = 1 then Pn \u2192 1\n2 .\n\n3\n\n\f3.1 Matched likelihood functions\n\nAssume that A obtains; then L\u221e =\n\nAssume that the likelihood model is both in\ufb01nitely precise and in\ufb01nitely accurate, meaning that\nwhen A (resp. \u00afA) obtains observations are generated i.i.d. according to \u03b1 (resp. \u03b2).\n\n(cid:179)\n(cid:81)K\n(cid:182)\u03b1i\n(cid:181)\n\ni=1\n\n\u03b2i\n\u03b1i\n\n\u03b2i\n\u03b1i\n\n=\n\nK(cid:89)\n\ni=1\n\n(cid:180)\u03b1i a.s. Let L\u221e = log L\u221e which in this case yields\nK(cid:88)\n\n= \u2212D(\u03b1||\u03b2) < 0\n\n\u03b1i log \u03b2i\n\u03b1i\n\ni=1\n\nL\u221e = log\n\nwhere all relations hold a.s., D(\u00b7||\u00b7) is the relative entropy [15], and the last inequality follows since\nby assumption \u03b1 (cid:54)= \u03b2. Since L\u221e < 0 \u21d4 L\u221e < 1, this implies that when the true generating\ndistribution is \u03b1, Pn \u2192 1 a.s.\nSimilarly, when \u00afA obtains, we have\n\n(cid:181)\n\nK(cid:89)\n\n(cid:182)\u03b2i\n\n\u03b2i\n\u03b1i\n\nK(cid:88)\n\ni=1\n\n=\n\nL\u221e = log\n\n\u03b2i log \u03b2i\n\u03b1i\n\n= D(\u03b2||\u03b1) > 0\n\nand Pn \u2192 0 a.s.\n\ni=1\n\n3.2 Mismatched likelihood functions\n\n(cid:80)\n\nNow consider the situation when the expert assessed likelihood model is incorrect. Assume the\nobservation generating distribution is \u03b3 = P(Zi = z) where \u03b3 (cid:54)= \u03b1 and \u03b3 (cid:54)= \u03b2. In this case,\nL\u221e =\n\n. We de\ufb01ne\n\n\u03b3i log \u03b2i\n\u03b1i\n\nT (\u03b3) = \u2212L\u221e =\n\n\u03b3i log \u03b1i\n\u03b2i\n\n(3)\n\n(cid:88)\n\ni\n\nThen the probability simplex over the observation space Z can be partitioned into two sets:\nP0 = {\u03b3|T (\u03b3) < 0} and P1 = {\u03b3|T (\u03b3) > 0}. By the a.s. convergence of the empirical dis-\ntribtuion, \u03b3 \u2208 Pi \u21d2 Pn \u2192 i. (The boundary set {\u03b3|T (\u03b3) = 0} represents an unstable equilibrium\nin which Pn a.s. converges to 1\n2 ).\nThe problem of mismatched likelihood functions is similar to composite hypothesis testing (c.f.\n[16] and references therein). Composite hypothesis testing attempts to design tests to determine the\ntruth or falsity of a hypothesis with some ambiguity in the underlying parameter space. Because of\nthis ambiguity, each hypothesis Hi corresponds not to a single distribution, but to a set of possible\ndistributions. In the mismatched likelihood function problem, composite spaces are formed due to\nthe properties of Bayes\u2019 rule for a speci\ufb01c likelihood model. A corollary of the above result is that if\nHi \u2286 Pi then Bayes\u2019 rule (under the speci\ufb01c likelihood model) is an asymptotically perfect detector.\n\n4 Multiple Assessors with Structural Constraints\n\nIn Section 3 we analyzed convergence properties of a single event under assessment. Considering\nmultiple events introduces the challenge of de\ufb01ning a dynamic concept of coherence for the assess-\nment revision process. In this section we suggest two possible de\ufb01nitions of dynamic coherence and\nconsider some of the implications of these de\ufb01nitions.\n\n4.1 Step-wise Coherence\n\nWe \ufb01rst introduce a step-wise de\ufb01nition of coherence, and derive equivalency conditions for the\nspecial class of 2-expert likelihood models.\n\n4\n\n\fDe\ufb01nition 1 Under the Bayes\u2019 rule revision process, a likelihood model p(z|A) is step-wise coher-\nent (SWC) if Pn \u2208 convhull(\u03c7) \u21d2 Pn+1 \u2208 convhull(\u03c7) for all z \u2208 Z.\nEssentially this de\ufb01nition says that if the posterior assessment process is coherent at any time, it\nwill remain coherent perpetually, independent of observation sequence. We derive necessary and\nsuf\ufb01cient conditions for SWC for the characteristic matrix given by\n\n(cid:183)\n\n(cid:184)\n\n\u03c7 =\n\n1\n0\n\n1\n1\n\n0\n0\n\n(4)\n\nGeneralizations of this development are possible for any \u03c7 \u2208 {0, 1}2\u00d7|\u0398|.\nNote that under the characteristic matrix given by (4) a model is SWC iff Pn(A1) \u2265 Pn(A2) for\nall n and all coherent P0. Proceeding inductively, assume Pn is marginally SWC, i.e. Pn(A1) =\nPn(A2) = \u03c0. Due to the continuity of the update rule, a model will be SWC iff it is coherent at the\nmargins. For coherence, for any i we must have Pn+1(A1) \u2265 Pn+1(A2). By substitution into (1)\n\u03b1i1\u03c0+\u03b2i1(1\u2212\u03c0) \u2265\nBy monotonicity, \u03b1i1\u03c0+\u03b2i1(1\u2212\u03c0)\nately, for \u03c7 given by (4), the model will be SWC iff \u03b1i1\n\u03b1i2\n\n\u03b1i2\u03c0+\u03b2i2(1\u2212\u03c0) or, equivalently, \u03b1i1\n\u03b1i2\u03c0+\u03b2i2(1\u2212\u03c0) \u2208\n\n, \u03b2i1\n. Since \u03b1i1\n\u03b1i2\n\u03b2i2\n\u2200i, or (rearranging)\n\n\u2265 \u03b1i1\u03c0+\u03b2i1(1\u2212\u03c0)\n\u03b1i2\u03c0+\u03b2i2(1\u2212\u03c0) .\n\n(cid:111)(cid:105)\n\n\u2265 \u03b1i1\n\n\u03b1i2\n\n\u2265 \u03b2i1\n\n\u03b2i2\n\n\u03b1i1\n\u03b1i2\n\n, \u03b2i1\n\u03b2i2\n\n(cid:111)\n\n\u03b1i2\n\n, max\n\n\u03b1i1\n\u03b1i2\n\n\u03b1i1\u03c0\n\n\u03b1i2\u03c0\n\ndegener-\n\n(cid:110)\n\n(cid:110)\n\n(cid:104)\n\nmin\n\n\u2200i,\n\n\u03b1i1\n\u03b2i1\n\n\u2265 \u03b1i2\n\u03b2i2\n\n(5)\n\n4.2 Asymptotic coherence\n\nWhile it is relatively simple to characterize coherent models in the two assessor case, in general\nSWC is dif\ufb01cult to check. As such, we introduce a simpler condition:\nDe\ufb01nition 2 A likelihood model p(z|A) is weakly asymptotically coherent (WAC) if for all obser-\nvation generating distributions \u03b3 s.t. limn\u2192\u221e Pn \u2208 {0, 1}N , \u2203i s.t. limn\u2192\u221e Pn = \u03c7ei a.s., where\nei is the ith unit vector.\n\nLemma 1 Step-wise coherence implies weakly asymptotic coherence.\n\nAssume that a model is SWC but not WAC. Since it\u2019s not WAC, there exists a \u03b3 s.t. Zi drawn\nIID from \u03b3 a.s. results in Pn \u2192 \u02c6P where \u02c6P \u2208 {0, 1}N is not a column of \u03c7 and is therefore\nnot coherent. Since this holds regardless of initial conditions, assume the process is initialized\ncoherently. Then, by a separating hyperplane argument, there must exist some n (and therefore\nsome zn) s.t. Pn \u2208 convhull(\u03c7) and Pn+1 /\u2208 convhull(\u03c7). This contradicts the assumption that\nthe likelihood model is SWC. Therefore any SWC model is also WAC. We demonstrate that the\nconverse is not true by counterexample in Section 4.2.2.\n\n4.2.1 WAC for static models\n\nAnalogous to (3), we de\ufb01ne\n\nFor a given \u03b3, de\ufb01ne the logical vector r(\u03b3) as\n\n(cid:88)\n\uf8f1\uf8f2\uf8f3 0\n\ni\n\nTj(\u03b3) =\n\n\u03b3i log \u03b1ij\n\u03b2ij\n\n.\n\nrj(\u03b3) =\n\nTj(\u03b3) < 0\nTj(\u03b3) > 0\nundet Tj(\u03b3) = 0\n\n1\n\nLemma 2 A likelihood model is WAC if \u2200\u03b3 s.t. limn\u2192\u221e Pn \u2208 {0, 1}N , \u2203i s.t. r(\u03b3) = \u03c7ei.\n\n5\n\n(6)\n\n(7)\n\n\fDe\ufb01ne the sets Pi = {\u03b3|r(\u03b3) = \u03c7ei}. Lemma 2 states that for a WAC likelihood model, {Pi}\npartitions the simplex (excluding unstable edge events) into sets of distributions s.t. \u03b3 \u2208 Pi \u21d2\nPn \u2192 \u03c7ei.\nIt is simple to show that the sets Pi are convex, and by de\ufb01nition the boundaries\nbetween sets are linear.\n\n4.2.2 Motivating Example Revisited\n\nConsider again the motivating example of the two IDPSs from Section 1.1. Recall that IDPS1 detects\na superset of the attacks detected by IDPS2, and so this scenario conforms to the characteristic matrix\nanalyzed in Section 4.1. Therefore (5) gives necessary and suf\ufb01cient conditions for SWC, while (7)\ngives necessary and suf\ufb01cient conditions for WAC.\nSuppose that both the IDPSs use the interval between packet arrivals as their observation and as-\nsume the learned likelihood models for the two IDPSs happen to be geometrically distributed with\nparameters x1, x2 (when an attack is occurring) and y1, y2 (when no attack is occurring), with the\nindex denoting the IDPS. We will analyze SWC and WAC for this class of models.\nPlugging the given likelihood model into (5) implies that the model is SWC iff, for z = 0, 1, 2, . . .\n\n(cid:181)\n\n(cid:182)z x1\n\n(cid:181)\n\n1 \u2212 x1\n1 \u2212 y1\n\u2265 x2\n\n\u2265\n\ny1\nand 1\u2212x1\n1\u2212y1\n\n1 \u2212 x2\n1 \u2212 y2\n\u2265 1\u2212x2\n1\u2212y2\n\n(cid:182)z x2\n\ny2\n\n(cid:88)\n\nz\n\n(8)\n\n(10)\n\nEquation (8) will be satis\ufb01ed iff x1\ny1\nsuf\ufb01cient condition for SWC.\nNow, we turn to WAC. Forming T as de\ufb01ned in (6), we see that\n\ny2\n\n, which is therefore a necessary and\n\nTj(\u03b3) =\n\n\u03b3zz log\n\n1 \u2212 xj\n1 \u2212 yj\n\n+ log xj\nyj\n\n= \u00b5 log\n\n1 \u2212 xj\n1 \u2212 yj\n\n+ log xj\nyj\n\n(9)\n\nwhere \u00b5 = E\u03b3[z]. By the structure of the characteristic matrix, the model will be WAC iff T2(\u03b3) >\n0 \u21d2 T1(\u03b3) > 0 for all \u00b5 \u2265 0. Assume for convenience that xi > yi. Then {\u03b3|Ti(\u03b3) < 0} =\n{\u03b3|\u00b5 <\n\nlog yi/xi\n\nlog(1\u2212xi)/(1\u2212yi)} and therefore the model is WAC iff\n\u2265 log x1\ny1\nlog 1\u2212x1\n1\u2212y1\n\nlog x2\ny2\nlog 1\u2212x2\n1\u2212y2\n\nComparing the conditions for SWC (8) to those for WAC (10), we see that any parameters satisfying\n(8) also satisfy (10) but not vice versa. For example x1 = 0.3, x2 = 0.5, y1 = 0.2, y2 = 0.25 don\u2019t\nsatisfy (8), but do satisfy (10). Thus WAC is truly a weaker sense of convergence than SWC.\n\n5 Coherence with only \ufb01nitely many observations\nAs shown in Sections 3 and 4, a WAC likelihood model generates a partition {Pi} over the obser-\nvation probability simplex such that \u03b3 \u2208 Pi \u21d2 Pn \u2192 \u03c7ei. The question we now address is, given\na WAC likelihood model and \ufb01nitely many observations (with empirical distribution \u02c6\u03b3n), how to\nrevise an incoherent posterior probability assessment Pn so that it is both coherent and consistent\nwith the observed data.\n\nPrinciple of Conserving Predictive Uncertainty: Given \u02c6\u03b3n, choose \u03bb such that\n\u03bbi = Pr[limn\u2192\u221e \u02c6\u03b3n \u2208 Pi] for each i (where \u03b3 \u2208 Pi iff Pn \u2192 \u03c7ei).\n\nThe principle of conserving predictive uncertainty states that in revising an incoherent assessment\nPn to a coherent one \u02dcPn, the weight vectors over the columns of \u03c7 should re\ufb02ect the uncertainty in\nwhether the observations are being generated by a distribution in the corresponding element of the\npartition {Pi} (and therefore whether Pn is converging to \u03c7ei).\n\n6\n\n\fGiven a uniform prior over generating distributions \u03b3 and assuming Lebesgue measure \u00b5 over the\nparameters of the generating distribution, we can write\n\n(cid:90)\n\n\u03b3\u2208Pi\n\nP (\u02c6\u03b3n|\u03b3)P (\u03b3)\n\n(cid:82)\nP P (\u02c6\u03b3n|\u03b3(cid:48))P (\u03b3(cid:48))d\u00b5(cid:48) d\u00b5\n(cid:82)\nP P (\u02c6\u03b3n|\u03b3(cid:48))d\u00b5(cid:48)\n\n\u03b3\u2208Pi\n\n(cid:90)\n\n1\n\n=\n\nP (\u02c6\u03b3n|\u03b3)d\u00b5\n\n(cid:90)\n(cid:90)\n\nP (\u03b3 \u2208 Pi|\u02c6\u03b3n) =\n\n=\n\n=\n\nP (\u03b3|\u02c6\u03b3n)d\u00b5\nP (\u02c6\u03b3n|\u03b3)\n\n(cid:82)\nP P (\u02c6\u03b3n|\u03b3(cid:48))d\u00b5(cid:48) d\u00b5\n\n\u03b3\u2208Pi\n\n\u03b3\u2208Pi\n\nIn the limit of large n P (\u02c6\u03b3n|\u03b3) .= e\u2212nD(\u02c6\u03b3||\u03b3) (where .= denotes equality to the \ufb01rst degree in the\nexponent; c.f. [15]). This implies that as n gets large, Pr[limn\u2192\u221e \u02c6\u03b3n \u2208 Pi] is dominated by the\ni = argmin\u03b3\u2208Pi D(\u02c6\u03b3n||\u03b3) (i.e. the reverse i-projection, or Maximum Likelihood estimate).\npoint \u03b3\u2217\nThis suggests the following approximation method for determining a coherent projection of Pn:\n\n(cid:80)\nj\u2264|{Pi}| P (\u02c6\u03b3|\u03b3\u2217\nj )\nThe relationship between the ML estimates (\u03b3\u2217\ni ) and the probability over the columns of the charac-\nteristic matrix is represented graphically in Figure 1. As will be shown in Section 6, the principle of\nconserving predictive uncertainty can even be effectively applied to non-WAC models.\n\nP (\u02c6\u03b3|\u03b3\u2217\ni )\n\n\u03bbj =\n\n(11)\n\nThe observation simplex\n\n2 r\n\nJ\n\nr\nr\n\nJ\nP1\nJ\nJ\n\u02c6\u03b3n=\u03b3\u2217\nB\nB\n\u03b3\u2217\n\u03b3\u2217\nB\n3\nBB\n\nr\n\n1\n\n4\n\nB\n\u03b3\u2217\n\nB\n\nP2\n\nP3\n\nJ\n\nP4\n\nJ\n\nJ\n\nJ\n\nJJ\n\n\u03c7e1\n\u0013\u0013J\nB\nB\n\nThe outcome simplex\n\npppppppp\nppppppppppp\np\npppppppppppppppppppppppppppp\u03bbr\np\np\np\np\np\np\np\n\u0010\u0010\u0010\u0010\u0010\u0010\u0010\u0010Z\n\nBB\n\u03c7e2\n\nJ\nB\n\n\u0013\n\n\u0013\n\n\u0013\n\n\u0013\n\n\u0013\n\n\u0013\n\n\u0013\n\nJ\n\nJ\n\nB\n\nJ\n\np\n\np\n\np\n\np\n\np\n\np\n\np\n\np\n\np\n\np\n\np\n\np\n\nJ\nZ\nJ\nZZ \u03c7e4\nJJ\n\n\u03c7e3\n\nFigure 1: The relationship between observation and outcome simplices\n\n5.1 Sparse coherent approximation\nIn general |\u0398| (the length of the vector \u03bb) can be of order 2N (where N is the number of assessors),\nso solving for \u03bb directly using (11) may be computationally infeasible. The following result sug-\ngests that to generate the optimal (in the sense of capturing to most possible weight) O(N) sparse\napproximation of \u03bb we need only calculate the O(N 2) reverse i-projections.\nLet \u03bb be determined according to (11) and let {Pi} be as de\ufb01ned in Section 4. Assume wlog that\n\u03bbi \u2265 \u03bbj for all i > j. De\ufb01ne the neighborhood of Pi as N (Pi) = {Pj : |r(Pi)\u2212r(Pj)| = 1} where\nr(Pi) is de\ufb01ned as in (7). The neighborhood of Pi is the set of partition elements such that the limit\nof one (and only one) assessor\u2019s probability assessment has changed. The size of the neighborhood\nis thus less than or equal to N.\nBy the assumed ordering of \u03bb and (11), it is immediately evident that \u02c6\u03b3 = \u03b3\u2217\n1, i.e. the maximally\nweighted partition element is the one that contains the empirical distribution. It can be shown that\nj