{"title": "A Bayesian Theory of Conformity in Collective Decision Making", "book": "Advances in Neural Information Processing Systems", "page_first": 9702, "page_last": 9711, "abstract": "In collective decision making, members of a group need to coordinate their actions in order to achieve a desirable outcome. When there is no direct communication between group members, one should decide based on inferring others' intentions from their actions. The inference of others' intentions is called \"theory of mind\" and can involve different levels of reasoning, from a single inference on a hidden variable to considering others partially or fully optimal and reasoning about their actions conditioned on one's own actions (levels of \u201ctheory of mind\u201d). In this paper, we present a new Bayesian theory of collective decision making based on a simple yet most commonly observed behavior: conformity. We show that such a Bayesian framework allows one to achieve any level of theory of mind in collective decision making. The viability of our framework is demonstrated on two different experiments, a consensus task with 120 subjects and a volunteer's dilemma task with 29 subjects, each with multiple conditions.", "full_text": "A Bayesian Theory of Conformity in Collective\n\nDecision Making\n\nKoosha Khalvati\n\nPaul G. Allen School of CSE\n\nUniversity of Washington\n\nkoosha@cs.washington.edu\n\nSaghar Mirbagheri\n\nDepartment of Psychology\n\nNew York University\nsm7369@nyu.edu\n\nSeongmin A. Park\n\nCenter for Mind and Brain\n\nUniversity of California, Davis\n\npark@isc.cnrs.fr\n\nJean-Claude Dreher\nNeuroeconomics Lab\n\nInstitut des Sciences Cognitives Marc Jeannerod\n\ndreher@isc.cnrs.fr\n\nRajesh P. N. Rao\n\nPaul G. Allen School of CSE &\n\nCenter for Neurotechnology\nUniversity of Washington\nrao@cs.washington.edu\n\nAbstract\n\nIn collective decision making, members of a group need to coordinate their actions\nin order to achieve a desirable outcome. When there is no direct communication\nbetween group members, one must decide based on inferring others\u2019 intentions\nfrom their actions. The inference of others\u2019 intentions is called \"theory of mind\"\nand can involve different levels of reasoning, from a single inference of a hidden\nvariable to considering others partially or fully optimal and reasoning about their\nactions conditioned on one\u2019s own actions (levels of \u201ctheory of mind\u201d). In this\npaper, we present a new Bayesian theory of collective decision making based on a\nsimple yet most commonly observed behavior: conformity. We show that such a\nBayesian framework allows one to achieve any level of theory of mind in collective\ndecision making. The viability of our framework is demonstrated on two different\nexperiments, a consensus task with 120 subjects and a volunteer\u2019s dilemma task\nwith 29 subjects, each with multiple conditions.\n\n1\n\nIntroduction\n\nCollective decision making is critical for survival in animals that forage as a group [1]. Even though\nhumans are not \"hunter-gatherers\" any more, collective decision making has remained a crucial\nelement of modern human society, as exempli\ufb01ed by the practice of trial by jury [2, 3]. Group\ndecision making can become extremely challenging when there is no communication between group\nmembers, such as in tasks requiring anonymous consensus or volunteers. In these situations, players\nhave to infer the intentions of others from their actions before making their own decisions.\nConformity or aligning one\u2019s actions with other group members is a behavior that has been widely ob-\nserved in group decision making by biologists and psychologists [4, 5, 6], for example, in developing\nsocial norms [7, 8]. In fact, even in competitive situations, humans may mimic their opponent\u2019s be-\nhavior unintentionally [9]. In a collective decision making task, by de\ufb01nition, at least some amount of\ncooperation between different group members is required for producing utility. Conformity provides\na mechanism for cooperation. However, in many situations, there is also some amount of competition\nbetween group members, making them cooperate strategically. For example, different players might\nprefer different outcomes. In these cases, additional processes, such as prediction of the effect of\none\u2019s action on others, are required for utility maximization.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe future state of mind of other group members and consequently their actions can be linked to\none\u2019s own current action if the group utilizes conformity. This link can be used by a player to select\nactions that produce maximum total utility. The ability to infer others\u2019 intentions, known as theory\nof mind (ToM), has been observed in humans extensively [10]. Humans are believed to be able to\ninfer others\u2019 current and even future states of the mind during social interactions [11]. While there\nexist some studies suggesting the connection of theory of mind and conformity in social decision\nmaking, e.g., the opportunity for reciprocity in multi-round games [12, 13], there is no mathematical\nframework or quantitative analysis demonstrating this connection.\nHere we present a Bayesian framework based on conformity and theory of mind for expected reward\nmaximization in collective decision making. First, we present a Bayesian model of conformity as\nthe basis for a framework for collective decision making. Then, we show how a (meta-)Bayesian\nagent can make better decisions (in terms of total utility gain) in complicated tasks with the presence\nof competitiveness between group members by reasoning about the belief of other Bayesian agents\nthat utilize conformity. In addition, we show this framework can be extended to model different\n\"levels of theory of mind\" in collective decision making. Following the terms used in the literature\non two-person interactions [14, 15], a Bayesian agent in our framework that utilizes conformity has\nlevel-0 ToM, a (meta-)Bayesian agent that reasons about other Bayesian agents that utilize conformity\nhas level-1 ToM, a (meta-)Bayesian agent that reasons about other (meta-)Bayesian agents that reason\nabout Bayesian agents that utilize conformity has level-2 ToM, and so on. This framework is a\ngeneralization of our previous level-1 ToM Bayesian model that explained human behavior in a group\ndecision making task [16] to multiple levels of theory of mind and consequently, a broader range of\ncollective decision making tasks.\nWe tested our framework on two different collective decision making experiments involving human\nsubjects: a consensus task and a volunteer\u2019s dilemma task. Our normative Bayesian framework\nexplained and predicted human behavior well on both of these tasks. Moreover, the level of theory\nof mind that the subjects utilized in the experiments was aligned with the components of the task\nand information such as the utility function of others. We also present an analysis of the \"level of\ntheory of mind\" at higher levels not observed in the experiments and show convergence to a Nash\nequilibrium in collective decision making [17].\n\n2 Problem De\ufb01nition and the Bayesian framework\n\nWe investigate the problem of collective decision making with N > 2 players and multiple rounds.\nIn each round, the players choose their actions simultaneously and then, all actions in that round\nare shown to all players, anonymously. The same set of actions is available to each player. More\nimportantly, the reward (utility) of each player in each round depends only on their own action and\nhow many of each of the available actions was selected by others in that round. For simplicity, we\nassume that the number of possible actions is 2, i.e., the set of actions A = {a1, a2}.\n\n2.1 Bayesian Conformity: Matching the Group\n\nConformity is matching the behavior of the whole group, i.e., if the probability of choosing action\na1 by an average group member is \u03b8, a player using conformity as their strategy should choose a1\nwith probability \u03b8 as well. This probability is not observable to the player and each round only\nprovides indirect information about this latent variable via the actions of all N players. Speci\ufb01cally,\nif the probability of choosing a1 by an average group member is \u03b8, the likelihood of observing m a1\nactions in total from the other N \u2212 1 players in a round is given by the Bernoulli probability density\nfunction:\n\n(cid:18)N \u2212 1\n(cid:19)\n\nm\n\nP (m|\u03b8) =\n\n\u03b8m(1 \u2212 \u03b8)N\u22121\u2212m.\n\n(1)\n\nAs \u03b8 is not observable, we assume the player maintains a \"belief\" about \u03b8, i.e., they maintain a\nprobability distribution over \u03b8. Because \u03b8 represents a binomial distribution, we express the belief\n(cid:82) 1\nof the player about \u03b8 with a Beta distribution, which is determined by two parameters \u03b1 and \u03b2:\nBeta(\u03b1, \u03b2) : P (\u03b8|\u03b1, \u03b2) \u221d \u03b8\u03b1\u22121(1 \u2212 \u03b8)\u03b2\u22121 with the divisive normalizing constant B(\u03b1, \u03b2) =\n0 \u03b8\u03b1\u22121(1 \u2212 \u03b8)\u03b2\u22121d\u03b8.\n\n2\n\n\fGiven the above equations, the probability of observing m a1 actions by other players is given by:\n\nP (m|\u03b1, \u03b2) =\n\nP (m|\u03b8)P (\u03b8|\u03b1, \u03b2)d\u03b8 =\n\n.\n\n(2)\n\n(cid:90) 1\n\n0\n\n(cid:18)N \u2212 1\n\n(cid:19) B(\u03b1 + m, \u03b2 + N \u2212 1 \u2212 m)\n\nm\n\nB(\u03b1, \u03b2)\n\nThe player can update their belief about \u03b8 after each round of the game using the Bayes rule, after the\nN \u2212 1 actions come from others. Also, the player considers their own action in the update as well,\nas they are a member of the group. The posterior probability of \u03b8 after observing m actions of a1\nfrom all N players is Beta(\u03b1 + m, \u03b2 + N \u2212 m) [18] (Beta distribution is the conjugate prior for\nBernoulli distribution).\nAs the task has multiple rounds, the player starts with a prior probability of Beta(\u03b10, \u03b20), updates\nit after each round, and uses the posterior probability of \u03b8 as the prior of the next round. This is a\nHidden Markov Model (HMM) where the player infers the state of mind of the group about the next\naction by observing the previous actions [18].\nDue to possible changes in other group members\u2019 strategies, the most recent observations are often\nmore reliable, thus deserving a larger weight in the inference. We model this by using a decay rate\n0 \u2264 \u03bb \u2264 1 for the prior. This means that a prior of Beta(\u03b1, \u03b2) with m out of N actions being a1 in\nthe current round results in a posterior of Beta(\u03bb\u03b1 + m, \u03bb\u03b2 + N \u2212 m). For example, \u03bb = 0 means\nthat only the most recent round determines the posterior and \u03bb = 1 considers all previous rounds\nequally important.\nAt the beginning of each round, the player chooses an action. According to the principle of conformity,\na1 should be chosen with probability of \u03b8. As the player has a posterior probability over \u03b8 instead of\nthe exact value, they use the expected value of \u03b8, which is \u03b1/(\u03b1 + \u03b2) [18]. We model this scenario\nwith an HMM (instead of a decision process - see below) as the player does not consider the effect of\ntheir own action on others and the reward function.\nIn summary, a player that utilizes Bayesian conformity has a prior belief of Beta(\u03b10, \u03b20) over \u03b8\nbefore the start of a multi-round game, with a decay rate \u03bb. At round t \u2265 0, the player chooses a1\nwith probability of \u03b1t/(\u03b1t + \u03b2t). Then, after observing everyone\u2019s actions in that round, if there\nare mt a1 actions in total, the belief changes to Beta(\u03b1t+1, \u03b2t+1) where \u03b1t+1 = \u03bb\u03b1t + mt and\n\u03b2t+1 = \u03bb\u03b2t + N \u2212 mt.\n\n2.2 Meta-Bayesian Conformity: In\ufb02uencing the Group\n\nThe model in the previous section ignored the fact that the player\u2019s actions can potentially in\ufb02uence\nthe actions of the group on average. If the group members also utilize conformity, one might be\nable to in\ufb02uence their future actions particularly if \u03b1 and \u03b2 are small, the decay rate is large, or if\nthe values of \u03b1 and \u03b2 are close to each other (i.e., \u03b8 is around .5). As a result, a player who takes\nadvantage of this knowledge can increase their total expected reward over the course of the game\nby leading the group to states that are more rewarding in the later rounds of the game. The idea of\nselecting actions that maximize the total expected reward transforms the model from one based on\nHMMs (previous section) to a Partially Observable Markov Decision Process (POMDP) [19].\nFormally, a POMDP is de\ufb01ned as a tuple (S, A, T, Z, O, R, \u03b3) where S is the set of states, A is\nthe set of possible actions, and T is the Markov transition function determining the next state s\ngiven the current state s(cid:48) and action a, i.e. P (s|s(cid:48), a). Z is the set of possible observations and O\nis the observation function that determines the probability of an observation given a state: P (z|s).\nThe reward function determines the reward (utility) received by the player in the current round.\nFinally, 0 \u2264 \u03b3 \u2264 1 is the discount factor for future rewards. Starting from an initial belief b0 (prior\nprobability) and similar to the Bayesian update in an HMM, the POMDP model updates its belief bt\nat round t based on its action and the observation (actions of other players) in that round. The goal of\nthe POMDP is to \ufb01nd a policy mapping beliefs to actions to maximize the total (expected) discounted\n\nreward, i.e.,(cid:80)\u221e\n\nt=0 \u03b3trt.\n\nFinding the optimal policy for a POMDP (solving the POMDP) is computationally very expensive.\nWhile recently developed methods can estimate the solution reasonably well [20], as in our framework\nthe belief can be captured with a few parameters (\u03b1 and \u03b2), we solve the POMDP by casting it\nas a Markov Decision Process (MDP) whose state space is the POMDP model\u2019s belief state space\n[21]. We de\ufb01ne our new MDP, (S, A, T, R, \u03b3) as follows. The state space is the space of (\u03b1, \u03b2)\n\n3\n\n\fdetermining the subject\u2019s belief after each round of the game. The actions remain the same as before:\n(A = {a1, a2}). As in the previous section, the transition function is de\ufb01ned by probability of\nobserving the actions of others (Equation 2), the player\u2019s own action, and the decay rate \u03bb:\n\n(cid:40)\n\nT :\n\nP ((\u03bb\u03b1 + m + 1, \u03bb\u03b2 + N \u2212 1 \u2212 m)|(\u03b1, \u03b2), a1) =(cid:0)N\u22121\nP ((\u03bb\u03b1 + m, \u03bb\u03b2 + N \u2212 m)|(\u03b1, \u03b2), a2) =(cid:0)N\u22121\n\n(cid:1) B(\u03bb\u03b1+m,\u03bb\u03b2+N\u22121\u2212m)\n\n(cid:1) B(\u03bb\u03b1+m,\u03bb\u03b2+N\u22121\u2212m)\n\nB(\u03bb\u03b1,\u03bb\u03b2)\n\nm\n\nm\n\nB(\u03bb\u03b1,\u03bb\u03b2)\n\n(3)\n\nAs speci\ufb01ed in the problem de\ufb01nition, the reward of each player in each round only depends on\ntheir own action at and actions by other players in that round. Thus, the reward function when\nA = {a1, a2} is a function of mt, number of a1 actions by others, and at: rt = R(mt, at). Since\nobservations are not modelled in MDPs, we calculate the reward for our MDP\u2019s state (which is the\nbelief state of the original POMDP) and action as:\n\nN\u22121(cid:88)\n\n(cid:18)N \u2212 1\n\n(cid:19) B(\u03b1 + m, \u03b2 + N \u2212 1 \u2212 m)\n\nm=0\n\nm\n\nB(\u03b1, \u03b2)\n\nR((\u03b1, \u03b2), a) =\n\nR(m, a).\n\n(4)\n\nWith the above de\ufb01nition of the components of the MDP, the optimal policy can be found easily using\nBellman\u2019s equation and dynamic programming [21]. This optimal policy, \u03c0\u2217\nt , is a function from the\nMDP\u2019s state space to the action space for (0 \u2264 t < H) where H is the maximum number of possible\nrounds (the horizon).\n\n2.3 Higher Levels of Theory of Mind\n\nA Bayesian agent that utilizes conformity only infers the state of mind of others without assuming\nothers have the same inference capability as well (level-0 ToM). The POMDP described above\nassumes others infer the state of mind of group members as well (level-1 ToM) to the extent of\nmatching their probability of actions. We extend this reasoning here to achieve higher levels of theory\nof mind. A level-k ToM agent is a POMDP agent that assumes others have level-(k \u2212 1) ToM. The\nInteractive-POMDP (I-POMDP) model is a general framework for modeling other POMDP agents\nwith arbitrary transition, observation, and reward functions [22]. If the rules of the task are conveyed\nto all players, achieving higher levels of ToM becomes more computationally tractable as the agent\nuses the same transition and observation functions for all members. A signi\ufb01cant practical problem,\nhowever, is that the reward function of others is not often known. As a result, modeling higher levels\nof ToM becomes more plausible when the reward function is (at least mostly) similar for all group\nmembers, e.g., in the case of monetary rewards.\nIf the player uses a common reward function Ro for others in the group, the level-k ToM agent\n(k > 1) is modeled as a POMDP with state space S = (\u03b1, \u03b2, t) where t is the round number\n(0 \u2264 t < H). The action space remains the same: A = {a1, a2}. The transition function becomes\ndeterministic as follows. Let the current state be (\u03b1, \u03b2, t). If \u03c0\u2217\nk\u22121,t = a1 where \u03c0k\u22121,t is the policy\nof the level-(k \u2212 1) POMDP with the reward function Ro, the next state is (\u03bb\u03b1 + N, \u03bb\u03b2, t + 1) for\naction a1 and (\u03bb\u03b1 + N \u2212 1, \u03bb\u03b2 + 1, t + 1) for action a2. Similarly, if \u03c0\u2217\nk\u22121,t = a2, the next state is\n(\u03bb\u03b1 + 1, \u03bb\u03b2 + N \u2212 1, t + 1) for the action of a1, and (\u03bb\u03b1, \u03bb\u03b2 + N, t + 1) for a2. Along the same\nlines, the reward function is:\n\nR((\u03b1, \u03b2, t), a) = R(cid:0)(N \u2212 1)I(\u03c0\u2217\n\nk\u22121,t = a1), a(cid:1) .\n\n(5)\n\nwhere I(x) is 1 if event x happens and 0 otherwise. Note that the assumed reward function of others,\nRo, could be different from the reward function R of the player but in practice, if there is one reward\nfunction for others, it probably applies to all group members including the player.\nIf the player does not know the reward function of others, they could estimate it based on the dynamics\nof the game. When a player uses higher than level-0 ToM, they know their actions could lead the\ngroup towards selecting actions that produce more reward for themselves. As a result, when a level-1\nor higher ToM player chooses an action, either the immediate reward for that action is higher for\nthem, or due to the state of the game, that action produces more expected reward despite producing\nless immediate reward. In the latter case, the chosen action would not change if one assumes a\nhigher reward for it. As a result, for level-k ToM (k > 1), when the state is (\u03b1, \u03b2), the player can\n\n4\n\n\fdivide other players into two groups based on their \"preference\" (immediate reward) for an action\nand estimate the reward function of each group separately:\n\n(cid:19)\n\nI(\u03c02\u2217\n\nk\u22121,t = a1), a\n\n.\n\n(6)\n\n(cid:18) (N \u2212 1)\u03b1\n\n\u03b1 + \u03b2\n\nI(\u03c01\u2217\n\nk\u22121,t = a1) +\n\n(N \u2212 1)\u03b2\n\u03b1 + \u03b2\n\nR((\u03b1, \u03b2, t), a) = R\n\nSimilar to the common Ro reward function case above, we de\ufb01ne \u03c01\u2217\nk\u22121,t as policies\nof level-(k \u2212 1) POMDP model and use the reward function Ro1 with action a1 having a higher\nimmediate reward, and Ro2 with action a2 being the more rewarding action.\n\nk\u22121,t and \u03c02\u2217\n\n3 Experimental Results\n\nWe tested our framework on the human behavioral data from two different collective decision making\nexperiments. The \ufb01rst was a consensus group decision making task [23] where N = 6 or N = 4\nplayers need to agree on one among several items presented to them within a limited number of\nrounds. The second experiment was a multiple-round Volunteer\u2019s Dilemma task [24] where N = 5\nplayers had to each decide whether to contribute to a pot of monetary units for the public good or\nnot. In each round of each game, every group member got extra reward if and only if at least k\nplayers had agreed or contributed. In both of these experiments, each subject played a multiple-round\ngame several times. The set of players did not change during each game. We \ufb01t models based on\nconformity with different levels of ToM to the behavior of each subject and compared the accuracy of\nthe different level models in explaining human behavior. Speci\ufb01cally, in our model \ufb01tting, we used a\nset of free parameters that explain all games of each subject across all the different conditions. Thus,\ndifferent conditions are modelled naturally in our framework without any parameter tuning for each\ncondition. The accuracy of a model was determined by the similarity between the model\u2019s predicted\naction and the actual action of the subject in each round on average. In other words, if the predicted\naction of a model was \u02c6a and the real action was a in a round, the average error was the average of the\nbinary error |\u02c6a \u2212 a| over all rounds of all games for the subject (accuracy = 1 - average error). In the\nlevel-0 ToM, similar to classi\ufb01cation methods, and to produce comparable results for higher levels of\nToM, the selected action was a1 when \u03b1/(\u03b1 + \u03b2) was more than .5, and a2 otherwise.\nWe also computed the overall accuracy (all levels of ToM combined) for our framework in each\ntask for each subject. In addition to \ufb01tting accuracy, we calculated Leave-One-Out Cross Validation\n(LOOCV) accuracy where at each iteration, the left-out data point was one whole game. We compared\nboth \ufb01tting and LOOCV accuracy of a reinforcement-based model-free approach to our framework\n[25, 26]. In this approach, the player chooses the most rewarding action according to rewards in\nprevious rounds: the agent starts the task with an initial value for each action, chooses the action with\nthe maximum value in each round, and updates its value based on the gained reward in that round\nwith a weight called the learning rate [27]. Details of the method are in the supplementary material.\n\n3.1 Consensus Decision Making\n\nWe analyzed the behavioral data of [23] with 120 subjects performing consensus decision making in\ngroups of N = 6 or N = 4 members. Each subject played the game 40 times (20 with 5 other players,\n20 with 3 other players). Each game contained multiple rounds; the number of rounds depended on\nwhen the players reached an agreement. Each game started with the presentation of two options,\ne.g, a piece of chocolate and an apple. The subjects had to choose one of them in each round. The\ngame ended when all players chose the same option. Options remained the same throughout the\nwhole game. After each round, each player observed the others\u2019 selected actions as red dots under\neach option (individual actions are anonymous). After the 10th round, if consensus had still not been\nreached, the game ended with a probability of 25% in each subsequent round. If the players reached\na unanimous consensus, they all received the chosen option. If the game ended without a consensus,\nsubjects did not receive anything. Before the experiment, subjects\u2019 preference or value for each item\n(between $0 to $4) was determined through a Becker-DeGroot-Marschak (BDM) auction [28]. More\ndetails of the task can be found in the original article describing this experiment [23].\n\n5\n\n\f3.1.1 Model Fitting and Predictions\n\nWe analyzed the games that lasted more than 1 round for each subject. As each player did not know\nthe values of other group members for each item, we used the actions from the \ufb01rst round as the\nprior for the rest of the game, i.e., \u03b11 = m0 and \u03b21 = N \u2212 m0 (equivalent to considering no prior\nknowledge at the beginning) where m0 is the number of players that chose option 1 in the \ufb01rst round.\nAs a result, for the conformity (level-0) model, there was only one free parameter, the decay rate \u03bb.\nFor the POMDP model (level-1), for the prior belief, we used the same approach as the one for the\nlevel-0 model above. The reward for each option for each player was set to the value of that item that\nwas determined in the auction process before the experiment, plus a constant value as the reward for\nwinning. This winning value was a free parameter but kept the same for all available items for each\nsubject. Moreover, it was constrained to be between $0 and $10 to be comparable with the estimated\nvalues of the items. Overall, the level-1 model had two free parameters (decay rate and winning\nvalue) for each subject.\nFor the level-2 ToM model, as the subjects did not know the others\u2019 values for each item, we used\nthe second approach explained in our framework, i.e., using two reward functions Ro1 and Ro2. We\nassumed that the subject estimated the value of each action for others as $2.5 for their favourite option\n(a1 for Ro1, and a2 for Ro2) and $1.5 for the other option, plus the same winning value for all players\n(similar to the level-1 model). The prior was also computed in the same manner as previous levels.\nTherefore, the free parameters of the level-2 (and higher) models are the same as the level-1 model.\nMore details of model \ufb01tting are presented in the supplementary material. As shown in Figure 1a,\nthe behavior of 83 of the 120 subjects in the experiment were better explained by the level-0 model\ncompared to the level-1 model in terms of higher \ufb01tting accuracy. The behavior of 30 out of the\nremaining 37 subjects were better explained by the level-1 model with more than a 5% difference,\nwith level-1 having one more free parameter. Average \ufb01tting accuracy was 77% and 72% for level\n0 and level-1 models respectively. The level-2 model had a lower \ufb01tting accuracy compared to the\nlevel-1 model in almost all subjects (Figure 1b). Therefore, we used only the \ufb01rst two levels of ToM\nfor further analysis. A model with a mix of level-0 and level-1, i.e., choosing the more accurate level\nfor each subject, resulted in 80% accuracy (SD = 0.07) (Figure 1c). The average LOOCV accuracy\nof this mixed model was 75% (SD = 0.12).\nWe also \ufb01t the model-free reinforcement learning model with three free parameters to the subjects\u2019\nbehavior. One of these free parameters was the winning value described above which was added\nto the value estimated for each item, the same as our ToM models above. For the learning rate, we\nused the function 1/(\u03ba0 + \u03ba1t) with \u03ba0 and \u03ba1 as the free parameters to model the dependency\nof the learning rate on the round number t. The average \ufb01tting accuracy of this model was 62%\n(SD = 0.13), signi\ufb01cantly worse than the \ufb01tting accuracy of both level-0 and level-1 ToM models in\nour framework (two-tailed paired t-test, compared to level-0: t(119) = 10.64, p < 0.001, compared\nto level-1: t(119) = 8.08, p < 0.001), and thus also signi\ufb01cantly worse than the mix of these two\nToM models (Figure 1d). In addition, the average LOOCV accuracy of the model-free reinforcement\nlearning model was 49% (SD = 0.12), signi\ufb01cantly worse than LOOCV accuracy of both the level-0\nand level-1 ToM models (two-tailed paired t-test, compared to level-0: t(119) = 15.57, p < 0.001,\ncompared to level-1: t(119) = 5.98, p < 0.001), and the mix of the two models (two-tailed paired\nt-test: t(119) = 15.71, p < 0.001).\nWhen \ufb01t to a subject\u2019s behavior only, our framework can predict whether or not the subject thinks\nthe game will end in the next round using Equation 2 in each state. This prediction cannot be fully\nvalidated as we do not have access to the subject\u2019s real belief. However, we compared the model\u2019s\nprediction to outcomes in the actual games. Our model based on a mix of level-0 and level-1 ToM\npredicted whether or not the game will end in each round with 76% average accuracy.\n\n3.2 Volunteer\u2019s Dilemma\n\nWe analyzed the behavioral data from a Volunteer\u2019s Dilemma task [24] where 29 subjects played 12\ngames of a multi-round thresholded Public Goods Game (PGG). Each game involved the subject\nplaying 15 rounds within the same group of N players (N = 5). At the beginning of each round, 1\nmonetary unit (MU) was given to each player. Each player could choose between two options: Giving\nup the 1 MU (contribute), or keeping it (free-ride). If at least k players contributed (\"volunteered\"),\nall players got another 2 MUs (public good) in that round. Otherwise, no reward were given. As a\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Different levels of ToM in consensus decision making. (a) Comparison of level-0 and\nlevel-1 ToM \ufb01tting accuracy for each subject. The level 0 model explained a greater proportion of the\nsubjects\u2019 behavior (see text for details). (b) Comparison of level-1 and level-2 ToM \ufb01tting accuracy\nfor each subject. Level-1 explained the subjects\u2019 behavior better for almost all of the subjects. (c)\nMixing the \ufb01rst two levels of ToM (picking the level with higher accuracy for each subject) increased\nthe accuracy of explaining the subjects\u2019 behavior. The green line shows the mean and the red line\nrepresents the median in each box plot. (d) Our framework outperformed the reinforcement-based\nmodel-free approach for almost all subjects.\n\nresult, If the public good was produced (the round was a success), the contributors ended up with 2\nMUs in that round while the free-riders had 3 MUs. Similarly, in the case of failure in producing\npublic good, contributors ended up 0 as they gave up their 1 MU while free-riders had 1 MU.\nSimilar to the consensus experiment, players observed everyone\u2019s actions anonymously at the end of\neach round. The required number of volunteers k, was chosen randomly between k = 2 and k = 4\nwith equal probability at the beginning of each game. This number was conveyed to the subjects and\nremained the same through out the whole game. While the subjects thought they were playing with\nother human players, in contrast to consensus task, they were playing with an algorithm (based on\nprevious human data) that generated the actions of other players. More details can be found in [24].\n\n3.2.1 Model Fitting and Predictions\n\nIn this experiment, the reward function for all players is the same because monetary reward was\nused (rather than items of different desirability). As a result, players might use a prior based on their\nprevious experience in life or \ufb01ctional play, even before the start of the game. For the models for all\nlevels of ToM, there are 3 free parameters in total, i.e., \u03bb, \u03b10 and \u03b20. As seen in Figures 2a and 2b,\nthe level-1 model\u2019s \ufb01tting accuracy was higher than the level-0 and level-2 models. There is also a\nstrong correlation between accuracy of different levels due to the fact that games with less changes,\ni.e., consistent contributions or consistent free rides by the subject, make the \ufb01t better for all methods.\nDue to the higher accuracy of the level-1 ToM model, we compared the model-free reinforcement\nlearning (RL) model only to the level-1 model. The RL model had 5 parameters in total. The \ufb01rst\nparameter was a reward for generating public good, which was added to the monetary reward. The\nnext two parameters determined the chance of producing public good for k = 2 versus k = 4, and\nwere used to de\ufb01ne the initial Q-value of each action. The \ufb01nal two free parameters determined the\nlearning rate, similar to the consensus task (more details in the supplementary material). The average\n\ufb01tting accuracy of level-1 POMDP model was 84% (SD = 0.06) while the average \ufb01tting accuracy\nfor the RL model was 79% (SD = .07) which is signi\ufb01cantly worse than the level-1 model\u2019s \ufb01tting\n(two-tailed paired t-test, t(28) = \u22126.75, p < 0.001). Also, the average LOOCV accuracy of level-1\nPOMDP model was 77% (SD = 0.08), signi\ufb01cantly higher than average LOOCV accuracy of the\nRL model which was 73% (SD = .09) (two-tailed paired t-test, t(28) = 2.20, p = 0.037).\nThe level-1 model after \ufb01tting can also predict a subject\u2019s belief about the success of the group in\ngenerating the public good in each round (similar to predicting the end of the game in the consensus\ntask). The level-1 model predicted the success rate of each round with 73% accuracy on average.\nMoreover, the pattern of this prediction was similar to the actual data when the games for different\nconditions (ks) were compared for each subject (Figure 2c).\nHigher levels of ToM in our framework assume deeper levels of optimality in terms of a player\u2019s own\nreward maximization. This optimality, also known as rationality in game theory [29], leads to a Nash\nequilibrium when the depth increases to in\ufb01nity [22, 17]. We tested this for our Volunteer\u2019s Dilemma\ntask, in which all free-rides is a Nash equilibrium. Speci\ufb01cally, we \ufb01t different levels of ToM to the\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Different levels of ToM in the volunteer\u2019s dilemma task. (a) Comparison of level-0\nmodel and level-1 ToM model \ufb01tting accuracy for each subject. The level-1 model explained the\nbehavior better for almost all of the subjects. (b) Comparison of level-1 and level-2 ToM \ufb01tting\naccuracy for each subject. Level-1 explained the behavior better for almost all of the subjects. (c)\nThe level-1 ToM model could predict the success rate of producing public good in each round quite\naccurately. The pattern of this prediction (blue circles) matched the reality (orange circles) when\nthe games with different conditions (k) were compared to each other for each subject. (d) As the\nlevel of ToM becomes higher, the average contribution rate converges to zero, consistent with the fact\nthat a fully rational agent should free ride in all rounds of a volunteer\u2019s dilemma game with a known\nnumber of rounds (free riding in all rounds is a Nash equilibrium).\n\ndata and calculated the average contribution rate of the subject predicted by the model. As seen in\nFigure 2d, the predicted contribution rate decreased to 0 gradually as the level of ToM increased,\ndespite being \ufb01t to a dataset with a contribution rate signi\ufb01cantly higher than 0. A player whose\nonly goal is maximizing their own reward does not contribute in the last round as there are no future\nrounds. In fact, consistent with the principle of conformity, the most important effect of contribution\nis increasing the contribution rate of others to produce more reward in the future. As the level of ToM\nincreases, the player free-rides in earlier rounds because others (modeled as optimal agents) would be\nexpected to free ride in later rounds. Thus, using higher levels of ToM shifts free-riding towards the\n\ufb01rst round, decreasing the contribution rate over all rounds to 0 (Figure 2d).\n\n4 Discussion\n\nWe presented a new Bayesian framework for modeling conformity and theory of mind in collective\ndecision making. To our knowledge, this framework is the \ufb01rst multi-level ToM model for group\ndecision making. Previous models covered only two-person interactions [14, 30, 15, 17, 31] or only\na single level of ToM [24, 32]. We demonstrated the viability of our framework using data from two\ndifferent experiments, each with different conditions. In addition to experimental \ufb01ts and predictions,\nwe showed that the levels of ToM that explain subject behavior are aligned with the conditions of\nthe two tasks. In the consensus task, the behavior of most of the subjects was explained best by a\nlevel-0 ToM model. In this task, different choices could be desirable to different subjects but they\nknew all players had to pick the same choice in order to \ufb01nish the current game and gain one of the\nchoices. On the other hand, in the volunteer\u2019s dilemma task, while the players might want to produce\npublic good, they knew that having more than the required number of volunteers would be a waste of\nresources (especially for k = 2). Strategic game play and reasoning on current and future intentions\nof others seems more necessary in the Volunteer\u2019s Dilemma task. Consistent with this intuition,\nnearly all subjects were better \ufb01tted with the level-1 ToM model. A level-2 ToM model could not\nexplain the subjects\u2019 behavior in our experiments. This was expected due to the computational cost\nof deep reasoning and lack of observance of higher levels in two-person interaction studies in general\n[15]. Finally, while we illustrated the approach with only two possible actions, the framework can be\neasily extendable to more actions simply by using multinomial and Dirichlet distributions [18].\n\nAcknowledgments\n\nThis work was funded by a Templeton World Charity Foundation grant, CRCNS NIMH grant no.\n5R01MH112166-03, and NSF grant no. EEC-1028725 and an NSF-ANR Collaborative Research in\nComputational Neuroscience \u2019CRCNS SOCIAL POMDP\u2019 no.16-NEUC grant. We thank Shinsuke\nSuzuki and John P. O\u2019Doherty for providing us their data from the consensus task.\n\n8\n\n\fReferences\n[1] Colin W. Clark and Marc Mangel. The evolutionary advantages of group foraging. Theoretical\n\nPopulation Biology, 30(1):45\u201375, August 1986.\n\n[2] Allen W. Johnson and Timothy K. Earle. The Evolution of Human Societies: From Foraging\n\nGroup to Agrarian State. Stanford University Press, 2000.\n\n[3] Seongmin A Park, Sidney Go\u00efame, David A O\u2019Connor, and Jean-Claude Dreher. Integration\nof individual and social information for decision-making in groups of different sizes. PLoS\nbiology, 15(6):e2001958, 2017.\n\n[4] Robert B Cialdini and Melanie R Trost. Social in\ufb02uence: Social norms, conformity and\n\ncompliance. In The Handbook of Social Psychology, volume 2, pages 151\u2013192, 1998.\n\n[5] Erica van de Waal, Christ\u00e8le Borgeaud, and Andrew Whiten. Potent Social Learning and\nConformity Shape a Wild Primate\u2019s Foraging Decisions. Science, 340(6131):483\u2013485, April\n2013.\n\n[6] Rachel L. Kendal, Isabelle Coolen, and Kevin N. Laland. The role of conformity in foraging\nwhen personal and social information con\ufb02ict. Behavioral Ecology, 15(2):269\u2013277, March\n2004.\n\n[7] M. Sherif. The psychology of social norms. The psychology of social norms. Harper, Oxford,\n\nEngland, 1936.\n\n[8] Andrew Whiten, Victoria Horner, and Frans B. M. de Waal. Conformity to cultural norms of\n\ntool use in chimpanzees. Nature, 437(7059):737, September 2005.\n\n[9] Marnix Naber, Maryam Vaziri Pashkam, and Ken Nakayama. Unintended imitation affects suc-\ncess in a competitive game. Proceedings of the National Academy of Sciences, 110(50):20046\u2013\n20050, 2013.\n\n[10] Simon Baron-Cohen, Alan M. Leslie, and Uta Frith. Does the autistic child have a \u201ctheory of\n\nmind\u201d ? Cognition, 21(1):37\u201346, October 1985.\n\n[11] Mark A Thornton, Miriam E Weaverdyck, and Diana I Tamir. The social brain automatically\n\npredicts others\u2019 future mental states. Journal of Neuroscience, 39(1):140\u2013148, 2019.\n\n[12] Ernst Fehr and Urs Fischbacher. The nature of human altruism. Nature, 425(6960):785, 2003.\n\n[13] Attila Szolnoki and Matja\u017e Perc. Conformity enhances network reciprocity in evolutionary\n\nsocial dilemmas. Journal of The Royal Society Interface, 12(103):20141299, 2015.\n\n[14] Wako Yoshida, Ray J. Dolan, and Karl J. Friston. Game Theory of Mind. PLOS Computational\n\nBiology, 4(12):e1000254, December 2008.\n\n[15] Marie Devaine, Guillaume Hollard, and Jean Daunizeau. Theory of Mind: Did Evolution Fool\n\nUs? PLOS ONE, 9(2):e87619, February 2014.\n\n[16] Koosha Khalvati, Seongmin A Park, Saghar Mirbagheri, Remi Philippe, Mariateresa Sestito,\nJean-Claude Dreher, and Rajesh PN Rao. Modeling other minds: Bayesian inference explains\nhuman choices in group decision-making. Science Advances, 5(11):eaax8783, 2019.\n\n[17] Andreas Hula, P Read Montague, and Peter Dayan. Monte carlo planning method estimates plan-\nning horizons during interactive social exchange. PLoS Computational Biology, 11(6):e1004254,\n2015.\n\n[18] K.P. Murphy. Machine Learning: A Probabilistic Perspective. Adaptive computation and\n\nmachine learning. MIT Press, 2012.\n\n[19] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in\n\npartially observable stochastic domains. Arti\ufb01cial Intelligence, 101:99 \u2013 134, 1998.\n\n[20] Stephane Ross, Joelle Pineau, Sebastien Paquet, and Brahim Chaib-draa. Online planning\n\nalgorithms for POMDPs. Journal of Arti\ufb01cial Intelligence Research, 32(1), 2008.\n\n9\n\n\f[21] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. MIT Press, Cambridge, MA\u201e 2005.\n\n[22] P. J. Gmytrasiewicz and P. Doshi. A Framework for Sequential Planning in Multi-Agent Settings.\n\nJournal of Arti\ufb01cial Intelligence Research, 24:49\u201379, July 2005.\n\n[23] Shinsuke Suzuki, Ryo Adachi, Simon Dunne, Peter Bossaerts, and John P O\u2019Doherty. Neural\n\nmechanisms underlying human consensus decision-making. Neuron, 86(2):591\u2013602, 2015.\n\n[24] Koosha Khalvati, Seongmin A Park, Jean-Claude Dreher, and Rajesh P Rao. A probabilistic\nIn Advances in Neural\n\nmodel of social decision making based on reward maximization.\nInformation Processing Systems, pages 2901\u20132909, 2016.\n\n[25] Patrick H McAllister. Adaptive approaches to stochastic programming. Annals of Operations\n\nResearch, 30(1):45\u201362, 1991.\n\n[26] Dilip Mookherjee and Barry Sopher. Learning and decision costs in experimental constant sum\n\ngames. Games and Economic Behavior, 19(1):97\u2013132, 1997.\n\n[27] John N Tsitsiklis. Asynchronous stochastic approximation and q-learning. Machine learning,\n\n16(3):185\u2013202, 1994.\n\n[28] Gordon M Becker, Morris H DeGroot, and Jacob Marschak. Measuring utility by a single-\n\nresponse sequential method. Behavioral science, 9(3):226\u2013232, 1964.\n\n[29] Drew Fudenberg and Jean Tirole. Game theory, 1991. Cambridge, Massachusetts, 393(12):80,\n\n1991.\n\n[30] Bahador Bahrami, Karsten Olsen, Peter E Latham, Andreas Roepstorff, Geraint Rees, and\n\nChris D Frith. Optimally interacting minds. Science, 329(5995):1081\u20131085, 2010.\n\n[31] Chris L Baker, Julian Jara-Ettinger, Rebecca Saxe, and Joshua B Tenenbaum. Rational quantita-\ntive attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour,\n1(4):0064, 2017.\n\n[32] Michael Shum, Max Kleiman-Weiner, Michael L. Littman, and Joshua B. Tenenbaum. Theory\nof minds: Understanding behavior in groups through inverse planning. CoRR, abs/1901.06085,\n2019.\n\n10\n\n\f", "award": [], "sourceid": 5129, "authors": [{"given_name": "Koosha", "family_name": "Khalvati", "institution": "University of Washington"}, {"given_name": "Saghar", "family_name": "Mirbagheri", "institution": "New York University"}, {"given_name": "Seongmin", "family_name": "Park", "institution": "Cognitive Neuroscience Center, CNRS"}, {"given_name": "Jean-Claude", "family_name": "Dreher", "institution": "cnrs"}, {"given_name": "Rajesh", "family_name": "Rao", "institution": "University of Washington"}]}