{"title": "Consistency of weighted majority votes", "book": "Advances in Neural Information Processing Systems", "page_first": 3446, "page_last": 3454, "abstract": "We revisit from a statistical learning perspective the classical decision-theoretic problem of weighted expert voting. In particular, we examine the consistency (both asymptotic and finitary) of the optimal Nitzan-Paroush weighted majority and related rules. In the case of known expert competence levels, we give sharp error estimates for the optimal rule. When the competence levels are unknown, they must be empirically estimated. We provide frequentist and Bayesian analyses for this situation. Some of our proof techniques are non-standard and may be of independent interest. The bounds we derive are nearly optimal, and several challenging open problems are posed. Experimental results are provided to illustrate the theory.", "full_text": "Consistency of weighted majority votes\n\nDaniel Berend Computer Science Department and Mathematics Department\n\nBen Gurion University\n\nBeer Sheva, Israel berend@cs.bgu.ac.il\n\nAryeh Kontorovich\n\nComputer Science Department Ben Gurion University\n\nBeer Sheva, Israel karyeh@cs.bgu.ac.il\n\nAbstract\n\nWe revisit from a statistical learning perspective the classical decision-theoretic\nproblem of weighted expert voting.\nIn particular, we examine the consistency\n(both asymptotic and \ufb01nitary) of the optimal Nitzan-Paroush weighted majority\nand related rules. In the case of known expert competence levels, we give sharp\nerror estimates for the optimal rule. When the competence levels are unknown,\nthey must be empirically estimated. We provide frequentist and Bayesian analyses\nfor this situation. Some of our proof techniques are non-standard and may be\nof independent interest. The bounds we derive are nearly optimal, and several\nchallenging open problems are posed.\n\n1\n\nIntroduction\n\nImagine independently consulting a small set of medical experts for the purpose of reaching a binary\ndecision (e.g., whether to perform some operation). Each doctor has some \u201creputation\u201d, which can\nbe modeled as his probability of giving the right advice. The problem of weighting the input of\nseveral experts arises in many situations and is of considerable theoretical and practical importance.\nThe rigorous study of majority vote has its roots in the work of Condorcet [1]. By the 70s, the \ufb01eld\nof decision theory was actively exploring various voting rules (see [2] and the references therein).\nA typical setting is as follows. An agent is tasked with predicting some random variable Y \u2208 {\u00b11}\nbased on input Xi \u2208 {\u00b11} from each of n experts. Each expert Xi has a competence level pi \u2208\n(0, 1), which is the probability of making a correct prediction: P(Xi = Y ) = pi. Two simplifying\nassumptions are commonly made:\n\n(i) Independence: The random variables {Xi : i \u2208 [n]} are mutually independent conditioned\n(ii) Unbiased truth: P(Y = +1) = P(Y = \u22121) = 1/2.\n\non the truth Y .\n\nWe will discuss these assumptions below in greater detail; for now, let us just take them as given.\n(Since the bias of Y can be easily estimated from data, only the independence assumption is truly\nrestrictive.) A decision rule is a mapping f : {\u00b11}n \u2192 {\u00b11} from the n expert inputs to the agent\u2019s\n\ufb01nal decision. Our quantity of interest throughout the paper will be the agent\u2019s probability of error,\n(1)\nA decision rule f is optimal if it minimizes the quantity in (1) over all possible decision rules. It\nwas shown in [2] that, when Assumptions (i)\u2013(ii) hold and the true competences pi are known, the\noptimal decision rule is obtained by an appropriately weighted majority vote:\n\nP(f(X) (cid:54)= Y ).\n\nf OPT(x) = sign\n\nwixi\n\n,\n\n(2)\n\n(cid:33)\n\n(cid:32) n(cid:88)\n\ni=1\n\n1\n\n\fwhere the weights wi are given by\n\nwi = log\n\npi\n1 \u2212 pi\n\n,\n\ni \u2208 [n].\n\n(3)\n\nThus, wi is the log-odds of expert i being correct \u2014 and the voting rule in (2), also known as naive\nBayes [3], may be seen as a simple consequence of the Neyman-Pearson lemma [4].\n\nMain results. The formula in (2) raises immediate questions, which apparently have not previ-\nously been addressed. The \ufb01rst one has to do with the consistency of the Nitzan-Paroush optimal\nrule: under what conditions does the probability of error decay to zero and at what rate? In Section 3,\nwe show that the probability of error is controlled by the committee potential \u03a6, de\ufb01ned by\n\nn(cid:88)\n\nn(cid:88)\n\n(pi \u2212 1\n\n2) log\n\npi\n1 \u2212 pi\n\n.\n\n(4)\n\n\u03a6 =\n\n(pi \u2212 1\n\n2)wi =\n\ni=1\n\ni=1\n\nMore precisely, we prove in Theorem 1 that log P(f OPT(X) (cid:54)= Y ) (cid:16) \u2212\u03a6, where (cid:16) denotes equiva-\nlence up to universal multiplicative constants.\nAnother issue not addressed by the Nitzan-Paroush result is how to handle the case where the com-\npetences pi are not known exactly but rather estimated empirically by \u02c6pi. We present two solutions\nto this problem: a frequentist and a Bayesian one. As we show in Section 4, the frequentist approach\ndoes not admit an optimal empirical decision rule. Instead, we analyze empirical decision rules in\nvarious settings: high-con\ufb01dence (i.e., |\u02c6pi \u2212 pi| (cid:28) 1) vs. low-con\ufb01dence, adaptive vs. nonadaptive.\nThe low-con\ufb01dence regime requires no additional assumptions, but gives weaker guarantees (Theo-\nrem 5). In the high-con\ufb01dence regime, the adaptive approach produces error estimates in terms of\nthe empirical \u02c6pis (Theorem 7), while the nonadaptive approach yields a bound in terms of the un-\nknown pis, which still leads to useful asymptotics (Theorem 6). The Bayesian solution sidesteps the\nvarious cases above, as it admits a simple, provably optimal empirical decision rule (Section 5). Un-\nfortunately, we are unable to compute (or even nontrivially estimate) the probability of error induced\nby this rule; this is posed as a challenging open problem.\n\n2 Related work\n\nMachine learning theory typically clusters weighted majority [5, 6] within the framework of online\nalgorithms; see [7] for a modern treatment. Since the online setting is considerably more adversarial\nthan ours, we obtain very different weighted majority rules and consistency guarantees. The weights\nwi in (2) bear a striking similarity to the Adaboost update rule [8, 9]. However, the latter assumes\nweak learners with access to labeled examples, while in our setting the experts are \u201cstatic\u201d. Still, we\ndo not rule out a possible deeper connection between the Nitzan-Paroush decision rule and boosting.\nIn what began as the in\ufb02uential Dawid-Skene model [10] and is now known as crowdsourcing, one\nattempts to extract accurate predictions by pooling a large number of experts, typically without the\nbene\ufb01t of being able to test any given expert\u2019s competence level. Still, under mild assumptions it\nis possible to ef\ufb01ciently recover the expert competences to a high accuracy and to aggregate them\neffectively [11]. Error bounds for the oracle MAP rule were obtained in this model by [12] and\nminimax rates were given in [13].\nIn a recent line of work [14, 15, 16] have developed a PAC-Bayesian theory for the majority vote\nof simple classi\ufb01ers. This approach facilitates data-dependent bounds and is even \ufb02exible enough\nto capture some simple dependencies among the classi\ufb01ers \u2014 though, again, the latter are learners\nas opposed to our experts. Even more recently, experts with adversarial noise have been consid-\nered [17], and ef\ufb01cient algorithms for computing optimal expert weights (without error analysis)\nwere given [18]. More directly related to the present work are the papers of [19], which character-\nizes the consistency of the simple majority rule, and [20, 21, 22] which analyze various models of\ndependence among the experts.\n\n2\n\n\f3 Known competences\n\nIn this section we assume that the expert competences pi are known and analyze the consistency of\nthe Nitzan-Paroush optimal decision rule (2). Our main result here is that the probability of error\nP(f OPT(X) (cid:54)= Y ) is small if and only if the committee potential \u03a6 is large.\nTheorem 1. Suppose that the experts X = (X1, . . . , Xn) satisfy Assumptions (i)-(ii) and\nf OPT : {\u00b11}n \u2192 {\u00b11} is the Nitzan-Paroush optimal decision rule. Then\n\n(i) P(f OPT(X) (cid:54)= Y ) \u2264 exp(cid:0)\u2212 1\n2\u03a6(cid:1).\n\n(ii) P(f OPT(X) (cid:54)= Y ) \u2265\n\n3\n\n\u221a\n8[1 + exp(2\u03a6 + 4\n\n.\n\n\u03a6)]\n\nAs we show in the full paper [27], the upper and lower bounds are both asymptotically tight. The\nremainder of this section is devoted to proving Theorem 1.\n\n3.1 Proof of Theorem 1(i)\nDe\ufb01ne the {0, 1}-indicator variables\n\n(5)\ncorresponding to the event that the ith expert is correct. A mistake f OPT(X) (cid:54)= Y occurs precisely\nwhen1 the sum of the correct experts\u2019 weights fails to exceed half the total mass:\n\n\u03bei = 1{Xi=Y },\n\nSince E\u03bei = pi, we may rewrite the probability in (6) as\n\nP(f OPT(X) (cid:54)= Y ) = P\n\n(cid:32)(cid:88)\n\nP\n\nwi\u03bei \u2264 E\n\ni\n\ni\n\n(cid:33)\n\nwi\n\n.\n\nn(cid:88)\n\ni=1\n\n(pi \u2212 1\n\n2)wi\n\n(cid:32) n(cid:88)\n(cid:34)(cid:88)\n\ni=1\n\nwi\u03bei \u2264 1\n2\n(cid:35)\n\nwi\u03bei\n\n\u2212(cid:88)\n(cid:32)\n\u22122(cid:2)(cid:80)\n\ni\n\n(cid:3)2\n\n(cid:33)\n\n(cid:80)\ni(pi \u2212 1\ni w2\ni\n\n2)wi\n\n(cid:33)\n\n.\n\n(6)\n\n(7)\n\nA standard tool for estimating such sum deviation probabilities is Hoeffding\u2019s inequality. Applied\nto (7), it yields the bound\n\nP(f OPT(X) (cid:54)= Y ) \u2264 exp\n\n,\n\n(8)\n\nwhich is far too crude for our purposes. Indeed, consider a \ufb01nite committee of highly competent\nexperts with pi\u2019s arbitrarily close to 1 and X1 the most competent of all. Raising X1\u2019s competence\nsuf\ufb01ciently far above his peers will cause both the numerator and the denominator in the exponent\n1, making the right-hand-side of (8) bounded away from zero. The inability of\nto be dominated by w2\nHoeffding\u2019s inequality to guarantee consistency even in such a felicitous setting is an instance of its\ngenerally poor applicability to highly heterogeneous sums, a phenomenon explored in some depth in\n[23]. Bernstein\u2019s and Bennett\u2019s inequalities suffer from a similar weakness (see ibid.). Fortunately,\nan inequality of Kearns and Saul [24] is suf\ufb01ciently sharp to yield the desired estimate: For all\np \u2208 [0, 1] and all t \u2208 R,\n\n(1 \u2212 p)e\u2212tp + pet(1\u2212p) \u2264 exp\n\n.\n\n(9)\n\n(cid:18)\n\n(cid:19)\n\n1 \u2212 2p\n\n4 log((1 \u2212 p)/p) t2\n\nRemark. The Kearns-Saul inequality (9) may be seen as a distribution-dependent re\ufb01nement of\nHoeffding\u2019s (which bounds the left-hand-side of (9) by et2/8), and is not nearly as straightforward\nto prove. An elementary rigorous proof is given in [25]. Following up, [26] gave a \u201csoft\u201d proof\nbased on transportation and information-theoretic techniques.\n\n1 Without loss of generality, ties are considered to be errors.\n\n3\n\n\f(cid:80)n\n\n(12)\ncorresponding to the event that the ith expert is correct and put qi = 1 \u2212 pi. The shorthand w \u00b7 \u03b7 =\ni=1 wi\u03b7i will be convenient. We will need some simple lemmata, whose proofs are deferred to\n\n\u03b7i = 2 \u00b7 1{Xi=Y } \u2212 1,\n\nthe journal version [27].\nLemma 2.\n\nP(f OPT(X) = Y ) = 1\n\n2\n\nmax{P (\u03b7), P (\u2212\u03b7)}\n\nand\n\n2\n\nP(f OPT(X) (cid:54)= Y ) = 1\n\n(cid:81)\n\nwhere P (\u03b7) =(cid:81)\nLemma 3. Suppose that s, s(cid:48) \u2208 (0,\u221e)m satisfy (cid:80)m\ni \u2208 [m], for some R < \u221e. Then (cid:80)m\n\ni=1 min{si, s(cid:48)\nLemma 4. De\ufb01ne the function F : (0, 1) \u2192 R by\n\ni:\u03b7i=\u22121 qi.\n\ni:\u03b7i=1 pi\n\n\u03b7\u2208{\u00b11}n\n\nmin{P (\u03b7), P (\u2212\u03b7)} ,\n\ni=1(si + s(cid:48)\ni} \u2265 a/(1 + R).\n\ni) \u2265 a and R\u22121 \u2264 si/s(cid:48)\n\ni \u2264 R,\n\n(cid:88)\n(cid:88)\n\n\u03b7\u2208{\u00b11}n\n\nPut \u03b8i = \u03bei \u2212 pi, substitute into (6), and apply Markov\u2019s inequality:\n\u2264 e\u2212t\u03a6Eexp\n\nP(f OPT(X) (cid:54)= Y ) = P\n\nwi\u03b8i \u2265 \u03a6\n\n(cid:32)\n\u2212(cid:88)\n\n(cid:33)\n\n(cid:32)\n\n\u2212t\n\ni\n\n(cid:33)\n\nwi\u03b8i\n\n.\n\n(10)\n\n(cid:88)\n\ni\n\nNow\n\n\u2264 exp\n\n\u22121 + 2pi\n\nEe\u2212twi\u03b8i = pie\u2212(1\u2212pi)wit + (1 \u2212 pi)epiwit\n4 log(pi/(1 \u2212 pi)) w2\ni t2\n(cid:32)\nwhere the inequality follows from (9). By independence,\n\n(cid:18)\n= (cid:89)\nand hence P(f OPT(X) (cid:54)= Y ) \u2264 exp(cid:0) 1\n\nEe\u2212twi\u03b8i \u2264 exp\n\n(cid:88)\n\nE exp\n\n(cid:32)\n\n(cid:33)\n\n\u2212t\n\nwi\u03b8i\n\ni\n\ni\n\n(cid:19)\n= exp(cid:2) 1\n(cid:88)\n\n(pi \u2212 1\n\n2)wit2(cid:3) ,\n2(pi \u2212 1\n(cid:33)\n= exp(cid:0) 1\n2\u03a6t2(cid:1)\n\n(11)\n\n2\u03a6t2 \u2212 \u03a6t(cid:1) . Choosing t = 1 yields the bound in Theorem 1(i).\n\n2)wit2\n\n1\n2\n\ni\n\n3.2 Proof of Theorem 1(ii)\nDe\ufb01ne the {\u00b11}-indicator variables\n\nThen sup0 0, qi = 1 \u2212 pi and B(x, y) = \u0393(x)\u0393(y)/\u0393(x + y). Our full\nprobabilistic model is as follows. Each of the n expert competences pi is drawn independently from\na Beta distribution with known parameters \u03b1i, \u03b2i. Then the ith expert, i \u2208 [n], is queried indepen-\ndently mi times, with ki correct predictions and mi\u2212ki incorrect ones. As before, K = (k1, . . . , kn)\nis the (random) committee pro\ufb01le. Absent direct knowledge of the pis, the agent relies on an empiri-\ncal decision rule \u02c6f : (x, k) (cid:55)\u2192 {\u00b11} to produce a \ufb01nal decision based on the expert inputs x together\nwith the committee pro\ufb01le k. A decision rule \u02c6f Ba is Bayes-optimal if it minimizes P( \u02c6f(X, K) (cid:54)= Y ),\nwhich is formally identical to (18) but semantically there is a difference: the former is over the pi\nin addition to (X, Y, K). Unlike the frequentist approach, where no optimal empirical decision rule\n\nwas possible, the Bayesian approach readily admits one: \u02c6f Ba(x, k) = sign ((cid:80)n\n\ni=1 \u02c6wBa\n\ni xi), where\n\n\u02c6wBa\n\ni = log\n\n\u03b1i + ki\n\n.\n\n\u03b2i + mi \u2212 ki\ni \u2212\u2192\nNotice that for 0 < pi < 1, we have \u02c6wBa\nmi\u2192\u221ewi, almost surely, both in the frequentist and\nthe Bayesian interpretations. Unfortunately, although P( \u02c6f Ba(X, K) (cid:54)= Y ) = P( \u02c6wBa \u00b7 \u03b7 \u2264 0) is\na deterministic function of {\u03b1i, \u03b2i, mi}, we are unable to compute it at this point, or even give a\nnon-trivial bound. The main source of dif\ufb01culty is the coupling between \u02c6wBa and \u03b7.\nOpen problem. Give a non-trivial estimate for P( \u02c6f Ba(X, K) (cid:54)= Y ).\n\n(29)\n\n6 Discussion\n\nThe classic and seemingly well-understood problem of the consistency of weighted majority votes\ncontinues to reveal untapped depth and suggest challenging unresolved questions. We hope that the\nresults and open problems presented here will stimulate future research.\n\nReferences\n[1] J.A.N. de Caritat marquis de Condorcet. Essai sur l\u2019application de l\u2019analyse `a la probabilit\u00b4e des d\u00b4ecisions\n\nrendues `a la pluralit\u00b4e des voix. AMS Chelsea Publishing Series. Chelsea Publishing Company, 1785.\n\n[2] S. Nitzan, J. Paroush. Optimal decision rules in uncertain dichotomous choice situations. International\n\nEconomic Review, 23(2):289\u2013297, 1982.\n\n[3] T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and\n\nPrediction. 2009.\n\n8\n\n\f[4] J. Neyman, E. S. Pearson. On the problem of the most ef\ufb01cient tests of statistical hypotheses. Phil. Trans.\n\nRoyal Soc. A: Math., Physi. Eng. Sci., 231(694-706):289\u2013337, 1933.\n\n[5] N. Littlestone, M. K. Warmuth. The weighted majority algorithm. In FOCS, 1989.\n[6] N. Littlestone, M. K. Warmuth. The weighted majority algorithm. Inf. Comput., 108(2):212\u2013261, 1994.\n[7] N. Cesa-Bianchi, G. Lugosi. Prediction, learning, and games. 2006.\n[8] Y. Freund, R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to\n\nboosting. J. Comput. Syst. Sci., 55(1):119\u2013139, 1997.\n\n[9] R. E. Schapire, Y. Freund. Boosting. Foundations and algorithms. 2012.\n[10] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM\n\nalgorithm. Applied Statistics, 28(1):20\u201328, 1979.\n\n[11] F. Parisi, F. Strino, B. Nadler, Y. Kluger. Ranking and combining multiple predictors without labeled data.\n\nProc. Nat. Acad. Sci., 2014+.\n\n[12] H. Li, B. Yu, D. Zhou. Error rate bounds in crowdsourcing models. CoRR, abs/1307.2674, 2013.\n[13] C. Gao, D. Zhou. Minimax Optimal Convergence Rates for Estimating Ground Truth from Crowdsourced\n\nLabels (arXiv:1310.5764), 2014.\n\n[14] A. Lacasse, F. Laviolette, M. Marchand, P. Germain, N. Usunier. PAC-Bayes bounds for the risk of the\n\nmajority vote and the variance of the gibbs classi\ufb01er. In NIPS, 2006.\n\n[15] F. Laviolette, M. Marchand. PAC-Bayes risk bounds for stochastic averages and majority votes of sample-\n\ncompressed classi\ufb01ers. JMLR, 8:1461\u20131487, 2007.\n\n[16] J.-F. Roy, F. Laviolette, M. Marchand. From PAC-Bayes bounds to quadratic programs for majority votes.\n\nIn ICML, 2011.\n\n[17] Y. Mansour, A. Rubinstein, M. Tennenholtz. Robust aggregation of experts signals, preprint 2013.\n[18] E. Eban, E. Mezuman, A. Globerson. Discrete chebyshev classi\ufb01ers. In ICML (2), 2014.\n[19] D. Berend, J. Paroush. When is Condorcet\u2019s jury theorem valid? Soc. Choice Welfare, 15(4):481\u2013488,\n\n1998.\n\n[20] P. J. Boland, F. Proschan, Y. L. Tong. Modelling dependence in simple and indirect majority systems. J.\n\nAppl. Probab., 26(1):81\u201388, 1989.\n\n[21] D. Berend, L. Sapir. Monotonicity in Condorcet\u2019s jury theorem with dependent voters. Social Choice and\n\nWelfare, 28(3):507\u2013528, 2007.\n\n[22] D. P. Helmbold and P. M. Long. On the necessity of irrelevant variables. JMLR, 13:2145\u20132170, 2012.\n[23] D. A. McAllester, L. E. Ortiz. Concentration inequalities for the missing mass and for histogram rule\n\nerror. JMLR, 4:895\u2013911, 2003.\n\n[24] M. J. Kearns, L. K. Saul. Large deviation methods for approximate probabilistic inference. In UAI, 1998.\n[25] D. Berend, A. Kontorovich. On the concentration of the missing mass. Electron. Commun. Probab.,\n\n18(3), 1\u20137, 2013.\n\n[26] M. Raginsky, I. Sason. Concentration of measure inequalities in information theory, communications and\n\ncoding. Foundations and Trends in Communications and Information Theory, 10(1-2):1\u2013247, 2013.\n\n[27] D. Berend, A. Kontorovich. A \ufb01nite-sample analysis of the naive Bayes classi\ufb01er. Preprint, 2014.\n[28] E. Baharad, J. Goldberger, M. Koppel, S. Nitzan. Distilling the wisdom of crowds: weighted aggregation\n\nof decisions on multiple issues. Autonomous Agents and Multi-Agent Systems, 22(1):31\u201342, 2011.\n\n[29] E. Baharad, J. Goldberger, M. Koppel, S. Nitzan. Beyond condorcet: optimal aggregation rules using\n\nvoting records. Theory and Decision, 72(1):113\u2013130, 2012.\n\n[30] J.-Y. Audibert, R. Munos, C. Szepesv\u00b4ari. Tuning bandit algorithms in stochastic environments. In ALT,\n\n2007.\n\n[31] V. Mnih, C. Szepesv\u00b4ari, J.-Y. Audibert. Empirical Bernstein stopping. In ICML, 2008.\n[32] A. Maurer, M. Pontil. Empirical Bernstein bounds and sample-variance penalization. In COLT, 2009.\n[33] A. Kontorovich. Obtaining measure concentration from Markov contraction. Markov Proc. Rel. Fields,\n\n4:613\u2013638, 2012.\n\n[34] D. Berend, A. Kontorovich. A sharp estimate of the binomial mean absolute deviation with applications.\n\nStatistics & Probability Letters, 83(4):1254\u20131259, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1801, "authors": [{"given_name": "Daniel", "family_name": "Berend", "institution": "Ben Gurion University"}, {"given_name": "Aryeh", "family_name": "Kontorovich", "institution": "Ben Gurion University"}]}