{"title": "On Testing for Biases in Peer Review", "book": "Advances in Neural Information Processing Systems", "page_first": 5286, "page_last": 5296, "abstract": "We consider the issue of biases in scholarly research, specifically, in peer review. There is a long standing debate on whether exposing author identities to reviewers induces biases against certain groups, and our focus is on designing tests to detect the presence of such biases. Our starting point is a remarkable recent work by Tomkins, Zhang and Heavlin which conducted a controlled, large-scale experiment to investigate existence of biases in the peer reviewing of the WSDM conference. We present two sets of results in this paper. The first set of results is negative, and pertains to the statistical tests and the experimental setup used in the work of Tomkins et al. We show that the test employed therein does not guarantee control over false alarm probability and under correlations between relevant variables, coupled with any of the following conditions, with high probability can declare a presence of bias when it is in fact absent: (a) measurement error, (b) model mismatch, (c) reviewer calibration. Moreover, we show that the setup of their experiment may itself inflate false alarm probability if (d) bidding is performed in non-blind manner or (e) popular reviewer assignment procedure is employed.  Our second set of results is positive, in that we present a general framework for testing for biases in (single vs. double blind) peer review. We then present a hypothesis test with guaranteed control over false alarm probability and non-trivial power even under conditions (a)--(c). Conditions (d) and (e) are more fundamental problems that are tied to the experimental setup and not necessarily related to the test.", "full_text": "On Testing for Biases in Peer Review\n\nIvan Stelmakh, Nihar B. Shah and Aarti Singh\n\nSchool of Computer Science\nCarnegie Mellon University\n\n{stiv,nihars,aarti}@cs.cmu.edu\n\nAbstract\n\nWe consider the issue of biases in scholarly research, speci\ufb01cally, in peer review.\nThere is a long standing debate on whether exposing author identities to reviewers\ninduces biases against certain groups, and our focus is on designing tests to detect\nthe presence of such biases. Our starting point is a remarkable recent work by\nTomkins, Zhang and Heavlin which conducted a controlled, large-scale experiment\nto investigate existence of biases in the peer reviewing of the WSDM conference.\nWe present two sets of results in this paper. The \ufb01rst set of results is negative,\nand pertains to the statistical tests and the experimental setup used in the work of\nTomkins et al. We show that the test employed therein does not guarantee control\nover false alarm probability and under correlations between relevant variables,\ncoupled with any of the following conditions, with high probability can declare\na presence of bias when it is in fact absent: (a) measurement error, (b) model\nmismatch, (c) reviewer calibration. Moreover, we show that the setup of their\nexperiment may itself in\ufb02ate false alarm probability if (d) bidding is performed in\nnon-blind manner or (e) popular reviewer assignment procedure is employed. Our\nsecond set of results is positive, in that we present a general framework for testing\nfor biases in (single vs. double blind) peer review. We then present a hypothesis\ntest with guaranteed control over false alarm probability and non-trivial power even\nunder conditions (a)\u2013(c). Conditions (d) and (e) are more fundamental problems\nthat are tied to the experimental setup and not necessarily related to the test.\n\nIntroduction\n\n1\nPast research in social sciences indicates that humans display various biases including gender, race and\nage biases in many critical domains such as hiring [4], university admission [32], bail decisions [2]\nand many others. Our focus is on fairness in academia and scholarly research, and speci\ufb01cally, on\nbiases in peer review. Peer review is a backbone of scholarly research and is employed by a vast\nmajority of journals and conferences. Due to the widespread prevalence of the Matthew effect \u2013\nrich get richer and poor get poorer \u2013 in academia [31, 27], any biases in peer review can have far\nreaching consequences on career trajectories of researchers. Speci\ufb01cally, we follow the long-standing\ndebate [6, 25, 26, 1, 23, 8, 35, 14, and references therein] on whether the authors\u2019 identities should\nbe hidden from reviewers or not. The focus of this paper is on designing statistical tests to detect the\npresence of biases in peer review.\nIn a recent remarkable piece of work, Tomkins et al. [33] conducted a large scale (semi-) randomized\ncontrolled trial during the peer review for the ACM International Conference on Web Search and Data\nMining (WSDM) 2017. In their experiment, the entire pool of reviewers was partitioned uniformly\nat random into two equal groups \u2013 single blind and double blind \u2013 and each paper was assigned\nto two reviewers from each of the groups. In this manner, the peer-review data contained both\nsingle-blind and double-blind reviews for each paper. The experiment allowed them to conduct a\ncausal inference to test for biases, and conclude that the single-blind system induces a bias in favor of\npapers authored by (i) researchers from top universities, (ii) researchers from top companies and (iii)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f0\n1\n\n.\n\n8\n0\n\n.\n\n6\n\n.\n\n0\n\n4\n0\n\n.\n\ny\nt\ni\nl\ni\n\n \n\nb\na\nb\no\nr\np\nm\nr\na\na\ne\ns\na\nF\n\n \n\nl\n\nl\n\n2\n\n.\n\n0\n\n0\n\n.\n\n0\n\nPrevious work\nDisagreement test\n\n0\n1\n\n.\n\nn\no\n\ni\nt\nc\ne\n\n8\n0\n\n.\n\n0\n1\n\n.\n\nn\no\n\ni\nt\nc\ne\n\n8\n0\n\n.\n\nt\n\nt\n\ne\nd\n\n6\n\n.\n\n \nf\n\n0\n\ne\nd\n\n6\n\n.\n\n \nf\n\n0\n\no\n\no\n\n \ny\nt\ni\nl\ni\n\n4\n0\n\n.\n\n \ny\nt\ni\nl\ni\n\n4\n0\n\n.\n\n2\n\n.\n\n0\n\nb\na\nb\no\nr\nP\n\n0\n\n.\n\n0\n\n200\n\n400\n\n600\n\n800 1000\n\nNumber of papers\n\n200\n\n400\n\n600\n\n800 1000\n\nNumber of papers\n\n2\n\n.\n\n0\n\nb\na\nb\no\nr\nP\n\n0\n\n.\n\n0\n\n200\n\n400\n\n600\n\n800 1000\n\nNumber of papers\n\n(a) Bias is absent. Any valid\ntest must have false alarm prob-\nability below 0.05.\n\n(b) Bias is present. Higher prob-\nability of detection is better.\n\n(c) Bias is present. Higher proba-\nbility of detection is better.\n\nFigure 1: Synthetic simulations evaluating performance of the test in Tomkins et al. [33] (\u201cprevious\nwork\u201d) and the test proposed in this paper (\u201cDISAGREEMENT test\u201d). Sub\ufb01gures (a) and (b) are in\npresence of correlations and noisy estimates of true scores by double-blind reviewers; sub\ufb01gure (c)\nhas zero correlations and perfect estimate of true scores by double-blind reviewers. Details of the\nsimulation setup are provided in Section 3. The error bars are too small to be visible.\n\nfamous authors. Interestingly, no bias against female-authored submissions was detected by their\ntest, though a meta-analysis con\ufb01rmed the presence of such bias. The conclusions of this experiment\nhave had a signi\ufb01cant impact. For instance, the WSDM conference itself completely switched to\ndouble-blind peer review starting 2018.\nTesting for the presence of hypothesized phenomena is a common task in various branches of science\nincluding the biological, social, and physical sciences. The general approach therein is to impose a\nhard constraint on the probability of false alarm (claiming existence of the phenomenon when there\nis none; also called Type-I error) to some prede\ufb01ned threshold called signi\ufb01cance level typically set\nas 0.05 or 0.01. The test would then aim to maximize the probability of detecting the phenomenon\nwhen it is actually present, while not violating the aforementioned hard constraint. The present paper\nalso follows this general approach, for the speci\ufb01c setting of testing for biases using single versus\ndouble blind reviewing.\n\nContributions. In this paper, we study the problem of detecting bias in peer review (in the setting as\nconsidered in Tomkins et al. [33]). In this context, we present two sets of results.\nNegative results (Section 3) We \ufb01rst analyze the testing procedure used by Tomkins et al., and show\nthat under plausible conditions the statistical test employed therein does not control for false alarm\nprobability. In other words, we show that under reasonable conditions, the test used by Tomkins et\nal. [33] can, with probability as large as 0.5 or higher, declare the presence of a bias when the bias\nis in fact absent (even when the test is tuned to have a false alarm error rate below 0.05). Speci\ufb01cally,\nwe show that in presence of correlations that are reasonable to expect, any of the following factors\nbreaks their false alarm probability guarantees: (a) measurement error caused by noise or subjectivity\nof reviewers, (b) model mismatch caused by violation of strong parametric assumptions on reviewers\u2019\nbehavior and (c) reviewer\u2019s calibration if she/he reviews more than one paper. Figures 1a and 1b\nillustrate the effect of measurement error on the false alarm probability and probability of detection\nof the test used by Tomkins et al. The issues we identify suggest that their test is at risk of committing\nType-I error in declaring biases in their analysis.\nMoving beyond the speci\ufb01c test used in Tomkins et al. [33], we also study the effect of their\nexperimental design, which is simply the standard peer-review procedure with an additional random\npartition of reviewers into single and double blind groups. We show that two factors \u2013 (d) asymmetrical\nbidding procedure and (e) non-random assignment of papers to referees \u2013 as is common in peer-review\nprocedures today may introduce spurious correlations in the data, breaking some key independence\nassumptions and thereby violating the requisite guarantees on testing.\nPositive results (Sections 4 and 5) We propose a general framework for the design of statistical tests to\ndetect biases in this problem setting, that overcomes the aforementioned limitations. Speci\ufb01cally, our\n\n2\n\n\fframework does not assume objectivity of reviewers and does not make any parametric assumptions\non reviewers\u2019 behaviour. Conceptually, we propose to think of this problem as an instance of a\ntwo-sample testing problem where single-blind and double-blind reviews form two samples and the\ntest operates on these samples. (In contrast, Tomkins et al. [33] study the problem under one-sample\ntesting paradigm, operating on reviews of single-blind reviewers and using double-blind reviews to\nestimate some parameters in their parametric model).\nWe then design a computationally-ef\ufb01cient hypothesis test with a provable control over the false alarm\nprobability under various conditions, including aforementioned conditions (a) - (c). Our test also has\nnon-trivial power in that it has considerably higher probability of detection in hard cases where the\ntest used by Tomkins et al. fails, and also has not too much loss in power when the assumptions made\nin Tomkins et al. [33] are exactly met, and there is no correlation or noise. The performance of this\ntest is illustrated in Figure 1.\n\nIt is important to note that in this work, we do not aim to prove or disprove the existence of biases\ndeclared in the experiment by Tomkins et al. [33]. Instead, our focus is on the theoretical validity\nof the statistical procedures used to conduct such experiments and more generally on principled\nstatistical approach towards designing such experiments. Finally, we note that the results and tests we\ndiscuss in this work are also applicable beyond peer review, and can be used to test for biases in other\ndomains such as admissions and hiring.\nRelated work. The problem of identifying biases in human decisions is commonly studied in social\nscience and there are many works that design and conduct randomized \ufb01eld experiments in various\nsettings, including resume screening [5], hiring in academia [22], and peer review [6, 23]. However,\nthe conference peer review setup we consider in this work does not comprise a fully randomized\ncontrol trial (i.e., the reviewers are not assigned to submissions at random) and past approaches fail\ndue to idiosyncrasies of the peer-review process. For example, a popular approach [5, 22] is to assign\nauthor identities to (fabricated) documents (resumes, application packages or papers) uniformly\nat random and compare the outcomes for different categories of authors. In our setup, random\nassignment of author identities to real (i.e., non-fabricated) submissions is problematic due to various\nlogistical and ethical issues such as reviewers guessing actual authors thereby causing biases, and\nrequrements of getting authors to agree to have their paper/name modi\ufb01ed. Another approach [23]\nis to submit the same paper to multiple reviewers in both single-blind and double-blind conditions\nand test for the difference in the acceptance rates between conditions. However, such an approach\nnecessitates a considerable additional reviewing load. Other approaches include observational studies,\nand we refer the interested readers to [33] for a more in-depth literature review.\nThis paper also falls in the line of several recent works in computer science on the peer-review process\nwhich includes both empirical [15, 13] and theoretical [30, 34, 17] studies.\n\n2 Preliminaries\nThe general peer-review setup we study for testing biases using single and double blind review is\nas considered in Tomkins et al. [33]. We study a conference peer-review setup where n papers are\nsubmitted at once and m independent reviewers are available to review submissions. With a goal\nto test whether single-blind reviewing induces a bias against or in favor of some groups of authors,\nwe consider some pre-de\ufb01ned set of k binary mutually non-exclusive properties pertaining to the\nauthor(s) of any paper to be tested for bias. For example, a property could be \u201cthe \ufb01rst author is\nfemale\u201d or \u201cmajority of authors are from the USA\u201d. Each paper j 2 [n] is then associated with k\nindicator variables w(1)\nj = 1\notherwise. For each ` 2 [k] we let J` denote the set of papers that satisfy property ` and J ` = [n]\\J`\ndenote its complement.1\nThe peer review process is conducted as follows. Each reviewer is uniformly at random allocated\nto one of the two conditions: (i) Double-Blind condition (DB) in which reviewers do not observe\nidentities of papers\u2019 authors; and (ii) Single-Blind condition (SB) in which reviewers observe identities\nof papers\u2019 authors. Next, each paper is assigned to  reviewers from the SB group and  reviewers\nfrom the DB group such that each reviewer reviews at most \u00b5 submissions, where  and \u00b5 are\nprede\ufb01ned constants. In both conditions, if any reviewer i 2 [m] is assigned to any paper j 2 [n],\n\nj = 1 if paper j satis\ufb01es property ` and w(`)\n\nj\n\n, . . . , w(k)\n\n, where w(`)\n\nj\n\n1Here, we adopt the standard notation [\u232b] = {1, 2, . . . ,\u232b } for any positive integer \u232b.\n\n3\n\n\fthen she/he returns a binary accept/reject recommendation and possibly a numeric score that estimates\na quality of the paper as perceived by the reviewer, accompanied by a textual review.\nFor each property ` 2 [k], we are interested in whether single-blind peer review setup induces a bias\nagainst or in favor of papers that satisfy this property. For example, if we consider property \u201cthe \ufb01rst\nauthor is female\u201d, then we aim at testing for the bias against or in favor of papers with female \ufb01rst\nauthor. Note that with respect to the properties, the study is observational in that we cannot assign\nauthor identities to papers at random. Hence, the effect of confounding is unavoidable and utmost\ncare must be taken to address presence of confounding factors.\nFor brevity, in this paper we consider the case of a single property of interest (k = 1) which captures\nthe complexity of our problem. For ease of notation, we drop index ` from w(`) and J`. For the\ndiscussion of the general case of multiple properties of interest (k > 1) we refer the reader to the\nextended version of this paper [29].\nLet us now give details of the testing procedure used by Tomkins et al. [33].\nModel and test used by Tomkins et al. We begin by introducing an idealized version of their model.\nThey assume a parametric, logistic model for the binary decisions made by SB reviewers. Speci\ufb01cally,\nfor each paper j 2 [n], let Y1j, . . . , Yj denote the binary accept/reject decisions given by the \nreviewers assigned to paper j in the SB setup. It is assumed that {Yrj}r2[] are independent draws\nfrom a Bernoulli random variable with an expectation \u21e1j satisfying\n\nlog\n\n\u21e1j\n1  \u21e1j\n\n= 0 + 1q\u21e4j + 2wj,\n\n(1)\n\nsigni\ufb01cance of the coef\ufb01cient 2. A bias is declared present if the coef\ufb01cient 2 is found signi\ufb01cant;\n\nwhere q\u21e4j is a \u201ctrue\u201d underlying score of paper j, wj is an indicator of property satisfaction and\n{0, 1, 2} are unknown coef\ufb01cients. In words, the model says that if there is a positive (negative)\nbias with respect to a property of interest, then the fact that paper satis\ufb01es the property increases\n(decreases) the log-odds of the probability of recommending acceptance by 22 as compared to the\ncase if the same paper does not satisfy the property. The main dif\ufb01culty with this model in the peer\nreview setting lies in the fact that true scores {q\u21e4j , j 2 [n]} are unknown and hence standard tests for\nlogistic regression model are not readily applicable.\nIn order to overcome the unavailability of true scores {q\u21e4j , j 2 [n]} in the model (1), Tomkins et\nal. [33] use a plug-in estimate: they replace q\u21e4j with the meaneqj of scores given by the DB reviewers\nto paper j, for every j 2 [n]. Under this approximation and usingeq1, . . . ,eqn, they obtain maximum\nlikelihood estimates of coef\ufb01cients {b0,b1,b2} and then use the standard Wald test [36] to test for\nthe direction of the bias is determined as the sign ofb2.\n\n3 Negative results\nIn this section we identify several issues that should be taken into account when testing for biases in\nthe setup we consider. Noting that the issues themselves are general, we motivate and discuss them in\ncontext of the prior work by Tomkins et al. [33] and investigate possible consequences of these issues\nthrough synthetic simulations.\nRecall that with respect to the property of interest the experiment is observational and hence we\ncannot assume that variable w that encodes property satisfaction is independent of true score q\u21e4. For\nexample, consider a property \u201cpaper has author from top univeristy\u201d. For this property a non-trivial\ncorrelation between true scores and indicators of property satisfaction is natural to expect. While\ncorrelation itself does not cause issues, we identify \ufb01ve conditions that coupled with correlation can\nbe signi\ufb01cantly harmful.\nIn the simulations to follow, we juxtapose the algorithm by Tomkins et al. [33] to our DISAGREEMENT\ntest introduced later in the paper. Complete details of all simulations are given in Appendix A.\n(a) Measurement error. Tomkins et al. [33] report low interreviewer agreement between DB\n\nreviewers which means that the estimateseq1, . . . ,eqn of the true scores by the DB reviewers are\n\nnoisy. It is known [28, 7] that noisy covariate measurement coupled with correlation between some\ncovariates may in\ufb02ate Type-I error rate of the Wald test for logistic regression. We now investigate the\nimpact of measurement error on the Type-I error rate of the Tomkins et al. test through simulations.\n\n4\n\n\f8\n\n.\n\n0\n\n6\n\n.\n\n0\n\n4\n\n.\n\n0\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n \nI\n\n\u2212\ne\np\ny\nT\n\n2\n\n.\n\n0\n\nPrevious work\nDisagreement \ntest\n\n0\n\n.\n\n1\n\n8\n\n.\n\n0\n\n6\n\n.\n\n0\n\n4\n\n.\n\n0\n\n2\n\n.\n\n0\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n \nI\n\n\u2212\ne\np\ny\nT\n\n0\n\n0.3\n\n0.2\n\n0.1\n\n0.0\n\n0.4\nCorrelation coefficient\n(a) Measurement error\n\n0\n1\n\n.\n\ne\n\nt\n\n8\n0\n\n.\n\na\nr\n \nr\no\nr\nr\ne\n\n6\n0\n\n.\n\n0\n\n.\n\n0\n\n.\n\n0\n\n0.3\n\n0.2\n\n0.1\n\n0.0\n\n0.4\nCorrelation coefficient\n(b) Model mismatch\n\n0\n\n.\n\n1\n\n0\n\n.\n\n1\n\ne\n\n8\n\nt\n\n.\n\n0\n\na\nr\n \nr\no\nr\nr\ne\n\n6\n\n.\n\n0\n\n \nI\n\n4\n\n.\n\n0\n\n\u2212\ne\np\ny\nT\n\n2\n\n.\n\n0\n\n0\n\n.\n\n0\n\n1\n\n40\n\n20\nReviewer load\n\n60\n\n80 100\n\n(c) Reviewers\u2019 input\n\nPrevious work\nDisagreement test\n\n \nI\n\n4\n\n.\n\n0\n\n\u2212\ne\np\ny\nT\n\n2\n\n.\n\n0\n\n0\n\n.\n\n0\n\nBlind\n\nNon\u2212blind\nBidding in SB condition\n(d) Non-blind bidding\n\n0\n1\n\n.\n\ne\n\nt\n\n8\n0\n\n.\n\na\nr\n \nr\no\nr\nr\ne\n\n6\n0\n\n.\n\n \nI\n\n4\n\n.\n\n0\n\n\u2212\ne\np\ny\nT\n\n2\n\n.\n\n0\n\n0\n\n.\n\n0\n\nTPMS\n\nRandom\n\nAssignment type\n\n(e) Non-random assignment\n\nFigure 2: Type-I error of the test from previous work (Tomkins et al. [33]) blows up under \ufb01ve\ndifferent setups: bias is absent in all simulations and the tests are designed to limit the Type-I error to\nat most 0.05. In contrast, our DISAGREEMENT test is robust to violations of modelling assumptions\n(a)-(c). Note that non-blind bidding by SB reviewers and non-randomness of the assignment (left\nbars in sub\ufb01gures d and e), which pertain to the experimental setup rather than the modelling, break\nguarantees of both tests. Error bars are too small to be visible.\n\nWe consider absence of any bias, and assume that model (1) with 2 = 0 is correct for both DB and\nSB reviewers. We consider DB reviewers to report noisy estimates of true scores q\u21e4j , and vary the\ncorrelation between q\u21e4 and w. We plot the Type-I error rates in Figure 2a for the test in Tomkins et\nal. [33] and our proposed test; both tests are designed to restrict the Type-I error rate to 0.05. Given\nthat interreviewer agreement in the actual experiment of Tomkins et al. [33] was low (level of noise\nis high), the fact that some properties they consider may lead to correlations between q\u21e4 and w is\nconcerning, because it could potentially undermine the validity of their \ufb01ndings.\nThe simulations in Section 1 follow the setup presented here: Figures 1a and 1b consider measurement\nerror with correlation \ufb01xed at 0.4 and 0.6 respectively. Notice that issues exacerbate as sample size\ngrows; Figure 1c has zero correlation and no measurement errors.\n(b) Model mismatch. Model (1) assumes a speci\ufb01c parametric relationship, which is unlikely to\nhold in practice. In order to check the effect of model mismatches, we consider a violation of the\n= 0 + 1q\u21e4j3 + 2wj. We consider\nmodel (1) and suppose that the correct model is log \u21e1j\n1\u21e1j\nan absence of any bias, that is, set 2 = 0 for both SB and DB reviewers. We perform simulations\nsimilar to those in item (a) with the exception that true scores q\u21e4j , j 2 [n], are known exactly to the\ntest of Tomkins et al. Figure 2b shows results of simulations.\n(c) Reviewer calibration. Model (1) assumes that reviews given by the same reviewer are inde-\npendent. In practice this assumption may be violated due to correlations introduced by reviewer\u2019s\ncalibration [34]. While some easy calibrations such as harshness/leniency can be captured by simple\nparametric extensions of model (1), more subtle patterns are beyond the scope of this model. Suppose\nfor example that the strength of reviewers\u2019 input depends on paper\u2019s clarity \u2013 the better the paper\nwritten, the lower the contribution due to reviewers\u2019 calibration. Assume also that we are given\na set of papers such that true score of each paper is proportional to the clarity of the paper (we\nformalize construction in Appendix A.1.3). Coupled with the correlation between q\u21e4 and w, this\npattern is suf\ufb01cient to break Type-I error guarantees of Tomkins et al. test. Figure 2c shows a result\nof simulations in which we vary the number of papers per reviewer, keeping correlation between q\u21e4\n\n5\n\n\fand w \ufb01xed at 0.75 and the total number of papers \ufb01xed at n = 1000. We simulate a wide range\nof reviewer load \u00b5 including small to medium loads of 5-15 papers typical in machine learning\nconferences like NeurIPS and larger loads of 40 or higher found in other smaller conferences.\n(d) Non-blind bidding. The issues discussed above pertain to the testing procedure and modelling\nassumptions made by Tomkins et al. [33], and can be avoided by designing a principled statistical\napproach to the testing problem as we do in Sections 4 and 5. We now issue a commentary regarding\nthe experimental setup considered in their work and show that the setup itself may create problems in\ncontrolling the Type-I error. In the experiment of Tomkins et al., papers are allocated to reviewers\nbased on \u201cbids\u201d representing their preferences. Importantly, the reviewers in the SB setup also get to\nsee author identities in the bidding stage, which may act as a confounding factor in tests for bias in\nthe acceptance/rejection of papers. Indeed, authorship information available to SB reviewers may\nintroduce a difference in bidding behaviour between conditions and this difference may result in\nstructurally different evaluations even when reviewers are unbiased, leading to a blow-up of the\nType-I error rate of any reasonable test as illustrated by Figure 2d (formal setup is in Appendix A.1.4).\nThis issue is indeed pointed out as a caveat by Tomkins et al. in their paper.\n(e) Reviewer assignment. Previous issue is indeed pointed out as a caveat by Tomkins et al. in\ntheir paper. One might imagine that a natural solution to the aforementioned problem would be\nto conduct the bidding in a double blind fashion for both groups. However, perhaps surprisingly,\nwe show that even if both groups bid in a double blind fashion (or even if the bidding process is\neliminated entirely), and even if the reviewers are assigned to DB or SB groups uniformly at random,\nthe non-random assignment using algorithms such as TPMS [9] can still swell the Type-I error rate.\nFigure 2e contrasts Type-I errors under random assignment of reviewers to papers, and an assignment\ncomputed by the TPMS algorithm [9] operating on a similarity matrix constructed in Appendix A.1.5,\nwhere reviewer decisions are correlated with the similarity (e.g., reviewers being more lenient on\npapers closer to their own area). Notably, even the DISAGREEMENT test which is robust to various\nissues discussed above is unable to control the Type-I error under the TPMS assignment. Due to\nmeasurement errors that arise in our construction, the test of Tomkins et al. [33] fails even under\nrandom assignment, and the non-randomness exacerbates the effect.\n\n4 Novel framework to test for biases\n\nIn Section 3 we identi\ufb01ed \ufb01ve key limitations of the approach taken by Tomkins et al. [33]. Three of\nthese limitations pertain to the testing procedure and the other two limitations regarding the bidding\nand assignment procedures relate to the design of experiment itself. In the next two sections we aim at\nimproving the test used by Tomkins et al. [33]. Hence, in the theoretical arguments below we assume\nthat reviewers\u2019 evaluations are independent of bids and similarities (alternatively, the assignment\nof papers to referees is selected uniformly at random from the set of all feasible assignments). In\nthe extended version of this paper [29] we relax the random assignment assumption by introducing\na novel experimental procedure that allows to use any assignment algorithm without running into\nissues (d) and (e) discussed above.\nOne approach to address the aforementioned issues with the testing procedure is to design methods\nfor logistic regression model that are robust to various factors such as noise, misspeci\ufb01cation,\netc. [20, 18, 28, 24, 7]. However, in this work we consider the problem more generally, because the\nlogistic model (1) itself could be highly inaccurate. Speci\ufb01cally, we aim at designing a test that does\nnot rely on strong modelling assumptions and also holds when reviewers decisions are subjective.\nAt a high level, our approach to testing for biases is different from those proposed by Tomkins\net al. [33] in two ways. First, we relax two strict modelling assumptions: (i) instead of assuming\nexistence of true qualities of submissions, we allow subjectivity in reviewer evaluations [16, 11, 3, 21,\n19], and (ii) we do not assume any speci\ufb01c form of the relationship between a paper and its probability\nof acceptance by a reviewer. Instead, we allow these probabilities to be completely arbitrary and\nde\ufb01ne bias in terms of these probabilities. Second, we treat this problem conceptually differently\nfrom the work of Tomkins et al. [33]. The test therein treats the problem as that of one-sample testing\nand uses DB scores as a plugin estimate of true scores in SB model. In contrast, we approach this\nproblem through the lenses of two-sample testing, where SB and DB reviews form the two samples,\nand the goal is to test whether they belong to the same distribution. This perspective helps us to avoid\na number of issues discussed in Section 3.\n\n6\n\n\fij\n\nij\n\nij\n\n.\n\nij\n\nij\n\nFormally, let \u21e7db 2 [0, 1]m\u21e5n be a matrix whose (i, j)th entry, denoted as \u21e1(db)\n, represents a\nprobability that reviewer i would recommend acceptance of paper j if that paper is assigned to that\nreviewer in DB setup. Similarly, let matrix \u21e7sb 2 [0, 1]m\u21e5n be an analogous matrix in SB setup, and\ndenote its (i, j)th entry as \u21e1(sb)\nLet RSB be the set of reviewers allocated to the SB condition. Moreover, for each i 2R SB, let PSB(i)\ndenote the set of papers assigned to reviewer i and let Yij 2{ 0, 1} denote the accept/reject decision\ngiven by reviewer i for paper j 2P SB(i). We similarly de\ufb01ne the set of DB reviewers RDB and\ntheir decisions {Xij : i 2R DB, j 2P DB(i)} . We are interested in testing for biases with respect to\na property of interest. Recalling our notation J\u2713 [n] for the set of papers that satisfy a property of\ninterest, and J as its complement, we now de\ufb01ne the bias testing problem.\nProblem 1 (Bias testing problem). Given signi\ufb01cance level \u21b5 2 (0, 1), and decisions of SB and DB\nreviewers, the goal is to test the following hypotheses:\nH0 : 8i 2 [m] 8j 2 [n] \u21e1(sb)\nH1 : 8i 2 [m] 8j 2 [n](\u21e1(sb)\n\nij = \u21e1(db)\nij \u21e1(db)\n\u21e1(sb)\nij \uf8ff\u21e1(db)\n\nif j 2J\nif j 2 J ,\n\n(2)\n\nij\n\n1 = 1, (db)\n\nij  \u21e1(db)\n\nwhere at least one inequality in the alternative hypothesis (2) is strict.2\nIn words, under the null hypothesis the knowledge of authors\u2019 identities does not induce any difference\nin reviewers\u2019 behaviour. On the other hand, under the alternative there is a bias in favor of papers that\nsatisfy the property of interest. Note that one can de\ufb01ne an alternative that represents a bias against\npapers from J simply by exchanging the sets J and J in (2).\nOur goal is to design a testing procedure that both controls for the Type-I error and has non-trivial\npower for any pair of matrices \u21e7sb, \u21e7db that fall under the de\ufb01nition of Problem 1.\nNon-trivial power. Informally, we say that the test has non-trivial power if for choices of \u21e7sb and\n\u21e7db for which the presence of bias is \u201cobvious\u201d, the test is able to detect the bias with probability\nthat goes to 1 as number of papers in both J and J grows to in\ufb01nity. Formally, we say that matrices\n\u21e7sb and \u21e7db satisfy the alternative hypothesis (2) with margin , if all inequalities in equation (2) are\nsatis\ufb01ed with margin > 0, that is, |\u21e1(sb)\n| > 8 (i, j) 2 [m] \u21e5 [n]. Then we say that the\ntesting procedure has non-trivial power if for any \"> 0 and for any > 0 there exists n0 = n0(\", )\nsuch that if min{|J |,|J |} > n0, then for any \u21e7sb and \u21e7db that satisfy alternative hypothesis (2) with\nmargin , the power of testing procedure is at least 1  \".\nFor instance, if the logistic model (1) is correct for both SB and DB reviewers with (sb)\n0 = 0,\n1 = (db)\n(sb)\n2 | > 0, then the requirement of non-trivial power ensures that\nfor any choice of true scores bounded in absolute value by a universal constant and any choice of\nproperty satisfaction indicators, the test has power growing to 1 as min{|J |,|J |} goes to in\ufb01nity.\n5 Positive results\nIn this section we present a test for Problem 1 and discuss some generalizations.\n5.1 Disagreement-based test\nLet us now introduce a statistical test for bias testing problem (Problem 1) and show that it satis\ufb01es\nrequirements of control over Type-I error and has non-trivial power. The test is built on two key ideas.\n\u2022 First, consider a pair of SB and DB reviewers who disagree in their decisions for some paper.\nUnder the null hypothesis, the events \u201cSB accepts and DB rejects\u201d and \u201cSB rejects and DB accepts\u201d\nare equiprobable. In contrast, under the alternative, SB reviewer is more (or less) likely to vote for\nacceptance than her/his DB counterpart, depending on the value of wj and the direction of the bias.\n\u2022 Second, in order to avoid correlations introduced by reviews given by the same reviewer, the test\nuses at most one decision per reviewer. It does so by \ufb01rst matching reviewers into pairs, consisting\nof one SB and one DB reviewer who review a common paper, and maximizing the number of such\npairs subject to a constraint that each reviewer appears in at most one pair.\n2An equivalent de\ufb01nition of the problem from the perspective of causal inference can be found in Appendix B.\n\n2 = 0 and |(sb)\n\n0 = (db)\n\n7\n\n\fFor a moment, assume that we are given a set of tuples T , where each tuple t 2T consists of a paper\njt 2 [n], decision of a SB reviewer for this paper Yjt, decision of a DB reviewer for this paper Xjt\nand indicator of property satisfaction wjt, with a constraint that each reviewer contributes her/his\ndecision to at most one tuple. In this setting, we formally describe our DISAGREEMENT test as Test 1.\nTest 1 DISAGREEMENT\nInput: Signi\ufb01cance level \u21b5 2 (0, 1).\n1. Initialize U and V to be empty arrays.\n2. For each tuple t 2T , if Yjt 6= Xjt, append Yjt to \u21e2U if wjt = 1\nif wjt = 1\n\nSet of tuples T , where each t 2T is of the form (jt, Yjt, Xjt, wjt) for some paper j 2 [n].\n\n3. Run permutation test [12] at the level \u21b5 to test if entries of U and V are exchangeable random\n\nvariables, using the test statistic:\n\nV\n\n.\n\n\u2327 =\n\n1\n\n|U| Xr2[|U|]\n\nUr \n\n1\n\n|V | Xr2[|V |]\n\nVr.\n\n4. Reject the null if and only if the permutation test rejects the null. (If any of the arrays V and U are\n\nempty, the test keeps the null.)\n\nWe now discuss construction of the set T for input to Test 1 from the given set of reviews. The goal\nof the construction is to ensure that T contains \u201cenough\u201d tuples that correspond to papers from J\nand J . We consider two cases.\n\u2022 If   \u00b5, then using the Hungarian matching algorithm each paper is matched to 1 SB reviewer\n\u2022 If < \u00b5 , then we use an iterative algorithm which greedily matches one paper from J and one\npaper from J to 1 SB and 1 DB reviewer in each iteration, with a constraint that each reviewer is\nmatched to at most one paper. This algorithm is guaranteed to match a constant fraction of papers\nfrom both J and J .\n\nand 1 DB reviewer, in a manner that each reviewer is matched to at most one paper.\n\nThe matching algorithms for both cases are formally presented in Appendix C due to lack of space.\nThe following theorem now presents guarantees for our test.\nTheorem 1. For any signi\ufb01cance level \u21b5 2 (0, 1), under the setup of the bias testing problem (Prob-\nlem 1), the DISAGREEMENT test coupled with matching algorithms from Appendix C is guaranteed\nto control for the Type-I error at the level \u21b5, and also satisfy the requirement of non-trivial power.\nRemark. If the logistic model (1) is correct, then Theorem 1 ensures that the DISAGREEMENT test\nprovably controls the Type I error and can detect a bias with probability that goes to 1 as sample size\ngrows, without requiring knowledge (neither exact nor approximate) of true scores q\u21e41, . . . , q\u21e4n.\nWe now discuss the issues (a)-(c) discussed in Section 3 in the context of our DISAGREEMENT test.\n\u2022 Measurement error. Our test does not rely on any estimation of papers\u2019 qualities made by\nreviewers. Moreover, we do not even assume that there exists some objective quantity that can be\nestimated. Hence, our test does not suffer from issues caused by noisy estimates of scores given by\nDB reviewers as illustrated by Figure 2a.\n\n\u2022 Model mismatch. The only assumption we make is that under the null hypothesis there is no\ndifference in behavior of SB and DB reviewers. Hence, Proposition 1 guarantees that our test is\nrobust to violations of speci\ufb01c parametric model (1) as illustrated by Figure 2b.\n\n\u2022 Reviewer calibration. We circumvent the detrimental effect of spurious correlations introduced\nby reviewers\u2019 calibration through a matching procedure that ensures that each reviewer contributes\nat most one review to the test. See Figure 2c for an illustration. Of course, such robustness comes\nat the cost of some power, but we notice that our matching procedures guarantee use of at least a\nconstant fraction of available data, thereby limiting the reduction in the power.\n\nEffect size. The test statistic \u2327 of the DISAGREEMENT test gives an estimate of the effect size.\nSlightly informally, \u2327 measures the difference in acceptance rates of \u201cborderline\u201d papers from J and\nJ in the SB setup. Indeed, by conditioning on pairs of disagreeing reviewers in Step 2 of Test 1,\nthe test rules out \u201cclear accept\u201d and \u201cclear reject\u201d papers thus considering only the papers for which\n\n8\n\n\freviewers disagree (i.e. borderline papers). The absolute value of the test statistic then is a reasonable\nestimate of the effect size and is in a similar vein to Cohen\u2019s d [10] and other popular measures.\n5.2 Generalization\nWe now consider a generalization of Problem 1 which accommodates an additional confounding factor\n\u2014 a bias in the reviewer simply due to her/his assignment in the SB or the DB group (and independent\nof the paper or its characteristics). For example, reviewers may not have any bias with respect to\nthe property of interest, but just being placed in the SB condition may induce more harsh opinions\nfrom the reviewers in DB. Formally, recall the null hypothesis \u21e1(sb)\n8(i, j) 2 [m] \u21e5 [n] in\nProblem 1. Instead, under the null, we now allow \u21e1(sb)\nij = f0(\u21e1(db)\n), for some monotonic function\nf0 : [0, 1] ! [0, 1]. As in Problem 1, the bias is then de\ufb01ned as a violation of the null hypothesis\nwhere direction of the violation is determined by the indicator wj.\nOf course, one may not know the function f0 and the goal of this general problem is to design\na test that is guaranteed to control over the Type-I error and has non-trivial power uniformly for\nall functions f0 that belong to some set of functions F\u21e4. The de\ufb01nition of non-trivial power from\nSection 4 transfers to this problem with the exception that all \u21e1(db)\n) for\nf0 2F \u21e4. In the extended version of this paper [29] we present a negative result showing that this\ngoal is impossible to achieve for general F\u21e4. However, in what follows we show that one can achieve\nthis goal under some speci\ufb01c choices of class F\u21e4, including a generalization of logistic model (1).\nGeneralized logistic model. For every paper j 2 [n], let qj 2 R be some unknown representation of\npaper j. The generalized logistic model assumes that for every (i, j) 2 [m] \u21e5 [n] we have\n1 qj + (sb)\n2 wj,\n\nare substituted by f0(\u21e1(db)\n\nij = \u21e1(db)\n\n0 + (db)\n\n0 + (sb)\n\n= (db)\n\nDB: log\n\n= (sb)\n\nSB: log\n\n1 qj,\n\n(3)\n\nij\n\nij\n\nij\n\nij\n\n\u21e1(db)\nij\n1  \u21e1(db)\n\nij\n\n\u21e1(sb)\nij\n1  \u21e1(sb)\n\nij\n\n0\n\n0\n\n1\n\nwhere qj, j 2 [n], and coef\ufb01cients are bounded in absolute value by constant B. The goal under this\nmodel is to test whether (sb)\n2 = 0. The model is called \u201cgeneralized\u201d, because it does not assume\nthat qj has a known meaning or that it can be measured. For instance, it may be that qj = q\u21e4j , or that\nqj = (q\u21e4j )3, or qj may be a complex function of the content of the paper. The generalized logistic\nmodel (3) falls in the framework of Problem 1 if (db)\n1 . Having de\ufb01ned\nnecessary terminology, we are now ready to formulate the main result of this section.\nTheorem 2. For any signi\ufb01cance level \u21b5 2 (0, 1), suppose that the generalized logistic model (3)\nis correct. If (sb)\n, then the DISAGREEMENT test is guaranteed to keep the Type-I error\nbelow \u21b5, and also satisfy the requirement of non-trivial power irrespective of whether (db)\nor\n(db)\n, then no test that operates\n0\non decisions of SB and DB reviewers can control for the Type-I error and simultaneously satisfy the\nrequirement of non-trivial power.\n\n0 . Conversely, if we allow both (sb)\n\n1 = (db)\n\n0 = (sb)\n\n1 = (sb)\n\n0 = (sb)\n\n6= (db)\n\n6= (db)\n\n6= (sb)\n\nand (db)\n\nand (sb)\n\n1\n\n0\n\n0\n\n1\n\nTheorem 2 shows that the generalized bias testing problem is much harder than the original Problem 1\nas there exists no algorithm that solves this problem in full generality even under speci\ufb01c model (3).\n6 Discussion\nIn this work we consider the problem of testing for biases in peer review. We show that under various\nconditions the approach used by prior work does not control over the Type-I error rate. We underscore\nthat we do not aim at disproving the presence of biases found in the past work, but our focus is on\nvalidity of testing methods. With this goal in mind, we propose a principled approach towards testing\nfor biases in peer review and design a test that provably controls for the Type-I error rate and also\nsatisfy the requirement of non-trivial power under minimal assumptions. As we showed in Section 5.2\n(more detailed discussion can be found in the extended version of this paper [29]), in general the\nassumptions we make cannot be relaxed without sacri\ufb01cing the non-trivial power requirement or\ncontrol over the Type-I error. On a separate note, we also demonstrate that the experimental setup\nof Tomkins et al. [33], which uses standard procedures for (non-random) assignment of papers to\nreviewers, can itself break Type-I error guarantees of statistical procedures. In the extended version of\nthis paper [29] we design a novel experimental procedure which (i) is amenable to standard conference\npeer review procedures, and (ii) does not violate Type-I error guarantees of our DISAGREEMENT test.\n\n9\n\n\fAcknowledgments\nThis work was supported in part by NSF grant CRII: CIF: 1755656 and in part by NSF grant CIF:\n1763734.\n\nReferences\n[1] Emily A. Largent and Richard T. Snodgrass. Blind Peer Review by Academic Journals, pages\n\n75\u201395. 12 2016.\n\n[2] David Arnold, Will Dobbie, and Crystal S Yang. Racial Bias in Bail Decisions*. The Quarterly\n\nJournal of Economics, 133(4):1885\u20131932, 05 2018.\n\n[3] Von Bakanic, Clark McPhail, and Rita J Simon. The manuscript review and decision-making\n\nprocess. American Sociological Review, pages 631\u2013642, 1987.\n\n[4] Marianne Bertrand and Sendhil Mullainathan. Are emily and greg more employable than lakisha\nand jamal? a \ufb01eld experiment on labor market discrimination. American economic review,\n94(4):991\u20131013, 2004.\n\n[5] Marianne Bertrand and Sendhil Mullainathan. Are Emily and Greg more employable than\nLakisha and Jamal? A \ufb01eld experiment on labor market discrimination. American Economic\nReview, 94(4):991\u20131013, September 2004.\n\n[6] Rebecca M Blank. The Effects of Double-Blind versus Single-Blind Reviewing: Experimental\nEvidence from The American Economic Review. American Economic Review, 81(5):1041\u2013\n1067, December 1991.\n\n[7] Jerry Brunner and Peter C. Austin. In\ufb02ation of Type I error rate in multiple regression when\nindependent variables are measured with error. Canadian Journal of Statistics, 37(1):33\u201346,\n2009.\n\n[8] Amber E. Budden, Tom Tregenza, Lonnie W. Aarssen, Julia Koricheva, Roosa Leimu, and\nChristopher J. Lortie. Double-blind review favours increased representation of female authors.\nTrends in Ecology and Evolution, 23(1):4 \u2013 6, 2008.\n\n[9] L. Charlin and R. S. Zemel. The Toronto Paper Matching System: An automated paper-reviewer\n\nassignment system. In ICML Workshop on Peer Reviewing and Publishing Models, 2013.\n\n[10] Jacob Cohen. A power primer. Psychological Bulletin, 112(1):155\u2013159, 1992.\n\n[11] Edzard Ernst and Karl-Ludwig Resch. Reviewer bias: a blinded experimental study. The\n\nJournal of laboratory and clinical medicine, 124(2):178\u2013182, 1994.\n\n[12] Ronald A. Fisher. The design of experiments. Oliver & Boyd, Oxford, England, 1935.\n\n[13] Yang Gao, Steffen Eger, Ilia Kuznetsov, Iryna Gurevych, and Yusuke Miyao. Does my rebuttal\n\nmatter? Insights from a major NLP conference. CoRR, abs/1903.11367, 2019.\n\n[14] Shawndra Hill and Foster J. Provost. The myth of the double-blind review? Author identi\ufb01cation\n\nusing only citations. SIGKDD Explorations, 5:179\u2013184, 01 2003.\n\n[15] Dongyeop Kang, Waleed Ammar, Bhavana Dalvi, Madeleine van Zuylen, Sebastian Kohlmeier,\nEduard H. Hovy, and Roy Schwartz. A dataset of peer reviews (PeerRead): Collection, insights\nand NLP applications. CoRR, abs/1804.09635, 2018.\n\n[16] Steven Kerr, James Tolliver, and Doretta Petree. Manuscript characteristics which in\ufb02uence\nacceptance for management and social science journals. Academy of Management Journal,\n20(1):132\u2013141, 1977.\n\n[17] Ari Kobren, Barna Saha, and Andrew McCallum. Paper matching with local fairness constraints.\nIn Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery\n& Data Mining, KDD \u201919, pages 1247\u20131257, New York, NY, USA, 2019. ACM.\n\n10\n\n\f[18] Nan M. Laird and James H. Ware. Random-effects models for longitudinal data. Biometrics,\n\n38(4):963\u2013974, 1982.\n\n[19] Mich\u00e8le Lamont. How professors think. Harvard University Press, 2009.\n[20] Mary J. Lindstrom and Douglas M. Bates. Nonlinear mixed effects models for repeated measures\n\ndata. Biometrics, 46(3):673\u2013687, 1990.\n\n[21] Michael J Mahoney. Publication prejudices: An experimental study of con\ufb01rmatory bias in the\n\npeer review system. Cognitive therapy and research, 1(2):161\u2013175, 1977.\n\n[22] Corinne A. Moss-Racusin, John F. Dovidio, Victoria L. Brescoll, Mark J. Graham, and Jo Han-\ndelsman. Science faculty\u2019s subtle gender biases favor male students. Proceedings of the National\nAcademy of Sciences, 109(41):16474\u201316479, 2012.\n\n[23] Kanu Okike, Kevin T. Hug, Mininder S. Kocher, and Seth S. Leopold. Single-blind vs Double-\n\nblind Peer Review in the Setting of Author Prestige. JAMA, 316(12):1315\u20131316, 09 2016.\n\n[24] Sophia Rabe-Hesketh, Andrew Pickles, and Anders Skrondal. Correcting for covariate mea-\nsurement error in logistic regression using nonparametric maximum likelihood estimation.\nStatistical Modelling, 3(3):215\u2013232, 2003.\n\n[25] Marco Seeber and Alberto Bacchelli. Does single blind peer review hinder newcomers?\n\nScientometrics, 113(1):567\u2013585, Oct 2017.\n\n[26] Richard Snodgrass. Single- versus double-blind reviewing: An analysis of the literature.\n\nSIGMOD Record, 35:8\u201321, 09 2006.\n\n[27] Flaminio Squazzoni and Claudio Gandelli. Saint Matthew strikes again: An agent-based model\nof peer review and the scienti\ufb01c community structure. Journal of Informetrics, 6(2):265\u2013275,\n2012.\n\n[28] Leonard A. Stefanski and Raymond J. Carroll. Covariate measurement error in logistic regres-\n\nsion. Ann. Statist., 13(4):1335\u20131351, 12 1985.\n\n[29] I. Stelmakh, N. Shah, and A. Singh. On testing for biases in peer review. Preprint, 2019.\n\nhttps://www.cs.cmu.edu/~istelmak/papers/bias.pdf.\n\n[30] Ivan Stelmakh, Nihar B. Shah, and Aarti Singh. PeerReview4All: Fair and accurate reviewer\n\nassignment in peer review. arXiv preprint arXiv:1806.06237, 2018.\n\n[31] Warren Thorngate and Wahida Chowdhury. By the numbers: Track record, \ufb02awed reviews,\njournal space, and the fate of talented authors. In Advances in Social Simulation, pages 177\u2013188.\nSpringer, 2014.\n\n[32] Ted Thornhill. We want black students, just not you: How white admissions counselors screen\nblack prospective students. Sociology of Race and Ethnicity, page 2332649218792579, 2018.\n[33] Andrew Tomkins, Min Zhang, and William D. Heavlin. Reviewer bias in single- versus double-\nblind peer review. Proceedings of the National Academy of Sciences, 114(48):12708\u201312713,\n2017.\n\n[34] Jingyan Wang and Nihar B. Shah. Your 2 is my 1, your 3 is my 9: Handling arbitrary\n\nmiscalibrations in ratings. CoRR, abs/1806.05085, 2018.\n\n[35] Thomas J. Webb, Bob O\u2019Hara, and Robert P. Freckleton. Does double-blind review bene\ufb01t\n\nfemale authors? Trends in Ecology and Evolution, 23(7):351 \u2013 353, 2008.\n\n[36] Sanford Weisberg. Applied Linear Regression. Wiley, Hoboken NJ, third edition, 2005.\n\n11\n\n\f", "award": [], "sourceid": 2851, "authors": [{"given_name": "Ivan", "family_name": "Stelmakh", "institution": "Carnegie Mellon University"}, {"given_name": "Nihar", "family_name": "Shah", "institution": "CMU"}, {"given_name": "Aarti", "family_name": "Singh", "institution": "CMU"}]}