{"title": "A Likelihood-Free Inference Framework for Population Genetic Data using Exchangeable Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 8594, "page_last": 8605, "abstract": "An explosion of high-throughput DNA sequencing in the past decade has led to a surge of interest in population-scale inference with whole-genome data. Recent work in population genetics has centered on designing inference methods for relatively simple model classes, and few scalable general-purpose inference techniques exist for more realistic, complex models. To achieve this, two inferential challenges need to be addressed: (1) population data are exchangeable, calling for methods that efficiently exploit the symmetries of the data, and (2) computing likelihoods is intractable as it requires integrating over a set of correlated, extremely high-dimensional latent variables. These challenges are traditionally tackled by likelihood-free methods that use scientific simulators to generate datasets and reduce them to hand-designed, permutation-invariant summary statistics, often leading to inaccurate inference. In this work, we develop an exchangeable neural network that performs summary statistic-free, likelihood-free inference. Our framework can be applied in a black-box fashion  across a variety of simulation-based tasks, both within and outside biology. We demonstrate the power of our approach on the recombination hotspot testing problem, outperforming the state-of-the-art.", "full_text": "A Likelihood-Free Inference Framework for\n\nPopulation Genetic Data using Exchangeable Neural\n\nNetworks\n\nJeffrey Chan\n\nUniversity of California, Berkeley\n\nchanjed@berkeley.edu\n\nJeffrey P. Spence\n\nUniversity of California, Berkeley\nspence.jeffrey@berkeley.edu\n\nSara Mathieson\n\nSwarthmore College\n\nsmathie1@swarthmore.edu\n\nValerio Perrone\n\nUniversity of Warwick\n\nv.perrone@warwick.ac.uk\n\nPaul A. Jenkins\n\nUniversity of Warwick\n\np.jenkins@warwick.ac.uk\n\nYun S. Song\n\nUniversity of California, Berkeley\n\nyss@berkeley.edu\n\nAbstract\n\nAn explosion of high-throughput DNA sequencing in the past decade has led to\na surge of interest in population-scale inference with whole-genome data. Re-\ncent work in population genetics has centered on designing inference methods\nfor relatively simple model classes, and few scalable general-purpose inference\ntechniques exist for more realistic, complex models. To achieve this, two inferential\nchallenges need to be addressed: (1) population data are exchangeable, calling\nfor methods that ef\ufb01ciently exploit the symmetries of the data, and (2) computing\nlikelihoods is intractable as it requires integrating over a set of correlated, extremely\nhigh-dimensional latent variables. These challenges are traditionally tackled by\nlikelihood-free methods that use scienti\ufb01c simulators to generate datasets and\nreduce them to hand-designed, permutation-invariant summary statistics, often\nleading to inaccurate inference. In this work, we develop an exchangeable neural\nnetwork that performs summary statistic-free, likelihood-free inference. Our frame-\nwork can be applied in a black-box fashion across a variety of simulation-based\ntasks, both within and outside biology. We demonstrate the power of our approach\non the recombination hotspot testing problem, outperforming the state-of-the-art.\n\n1\n\nIntroduction\n\nStatistical inference in population genetics aims to quantify the evolutionary events and parameters\nthat led to the genetic diversity we observe today. Population genetic models are typically based\non the coalescent [1], a stochastic process describing the distribution over genealogies of a random\nexchangeable set of DNA sequences from a large population. Inference in such complex models\nis challenging. First, standard coalescent-based likelihoods require integrating over a large set\nof correlated, high-dimensional combinatorial objects, rendering classical inference techniques\ninapplicable. Instead, likelihoods are implicitly de\ufb01ned via scienti\ufb01c simulators (i.e., generative\nmodels), which draw a sample of correlated trees and then model mutation as Poisson point processes\non the sampled trees to generate sequences at the leaves. Second, inference demands careful treatment\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fof the exchangeable structure of the data (a set of sequences), as disregarding it leads to an exponential\nincrease in the already high-dimensional state space.\nCurrent likelihood-free methods in population genetics leverage scienti\ufb01c simulators to perform\ninference, handling the exchangeable-structured data by reducing it to a suite of low-dimensional,\npermutation-invariant summary statistics [2, 3]. However, these hand-engineered statistics typically\nare not statistically suf\ufb01cient for the parameter of interest. Instead, they are often based on the\nintuition of the user, need to be modi\ufb01ed for each new task, and are not amenable to hyperparameter\noptimization strategies since the quality of the approximation is unknown.\nThe goal of this work is to develop a general-purpose inference framework for raw population genetic\ndata that is not only likelihood-free, but also summary statistic-free. We achieve this by designing a\nneural network that exploits data exchangeability to learn functions that accurately approximate the\nposterior. While deep learning offers the possibility to work directly with genomic sequence data,\npoorly calibrated posteriors have limited its adoption in scienti\ufb01c disciplines [4]. We overcome this\nchallenge with a training paradigm that leverages scienti\ufb01c simulators and repeatedly draws fresh\nsamples at each training step. We show that this yields calibrated posteriors and argue that, under a\nlikelihood-free inference setting, deep learning coupled with this \u2018simulation-on-the-\ufb02y\u2019 training has\nmany advantages over the more commonly used Approximate Bayesian Computation (ABC) [2, 5].\nTo our knowledge, this is the \ufb01rst method that handles the raw exchangeable data in a likelihood-free\ncontext.\nAs a concrete example, we focus on the problems of recombination hotspot testing and estimation.\nRecombination is a biological process of fundamental importance, in which the reciprocal exchange\nof DNA during cell division creates new combinations of genetic variants. Experiments have shown\nthat many species exhibit recombination hotspots, i.e., short segments of the genome with high\nrecombination rates [6]. The task of recombination hotspot testing is to predict the location of\nrecombination hotspots given genetic polymorphism data. Accurately localizing recombination\nhotspots would illuminate the biological mechanism that underlies recombination, and could help\ngeneticists map the mutations causing genetic diseases [7]. We demonstrate through experiments that\nour proposed framework outperforms the state-of-the-art on the hotspot detection problem.\nOur main contributions are:\n\ncalibrated posteriors.\n\nthe data to the posterior distribution over the parameter of interest.\n\n\u2022 A novel exchangeable neural network that respects permutation invariance and maps from\n\u2022 A simulation-on-the-\ufb02y training paradigm, which leverages scienti\ufb01c simulators to achieve\n\u2022 A general-purpose likelihood-free Bayesian inference method that combines the exchange-\nable neural network and simulation-on-the-\ufb02y training paradigm to both discrete and contin-\nuous settings. Our method can be applied to many population genetic settings by making\nstraightforward modi\ufb01cations to the simulator and the prior, including demographic model\nselection, archaic admixture detection, and classifying modes of natural selection.\n\u2022 An application to a single-population model for recombination hotspot testing and estimation,\noutperforming the model-based state-of-the-art, LDhot. Our approach can be seamlessly\nextended to more complex model classes, unlike LDhot and other model-based methods.\n\nOur software package defiNETti is publicly available at https://github.com/popgenmethods/\ndefiNETti.\n\n2 Related Work\n\nLikelihood-free methods like ABC have been widely used in population genetics [2, 5, 8, 9, 10].\nIn ABC the parameter of interest is simulated from its prior distribution, and data are subsequently\nsimulated from the generative model and reduced to a pre-chosen set of summary statistics. These\nstatistics are compared to the summary statistics of the real data, and the simulated parameter is\nweighted according to the similarity of the statistics to derive an empirical estimate of the posterior\ndistribution. However, choosing summary statistics for ABC is challenging because there is a trade-\noff between loss of suf\ufb01ciency and computational tractability. In addition, there is no direct way to\nevaluate the accuracy of the approximation.\n\n2\n\n\fOther likelihood-free approaches have emerged from the machine learning community and have been\napplied to population genetics, such as support vector machines (SVMs) [11, 12], single-layer neural\nnetworks [13], and deep learning [3]. Recently, a (non-exchangeable) convolutional neural network\nmethod was proposed for raw population genetic data [14]. The connection between likelihood-free\nBayesian inference and neural networks has also been studied previously [15, 16]. An attractive\nproperty of these methods is that, unlike ABC, they can be applied to multiple datasets without\nrepeating the training process (i.e., amortized inference). However, current practice in population\ngenetics collapses the data to a set of summary statistics before passing it through the machine\nlearning models. Therefore, the performance still rests on the ability to laboriously hand-engineer\ninformative statistics, and must be repeated from scratch for each new problem setting.\nThe inferential accuracy and scalability of these methods can be improved by exploiting symmetries\nin the input data. Permutation-invariant models have been previously studied in machine learning for\nSVMs [17] and recently gained a surge of interest in the deep learning literature. Recent work on\ndesigning architectures for exchangeable data include [18], [19], and [20], which exploit parameter\nsharing to encode invariances.\nWe demonstrate these ideas on the discrete and continuous problems of recombination hotspot testing\nand estimation, respectively. To this end, several methods have been developed (see, e.g., [21, 22, 23]\nfor the hotspot testing problem). However, none of these are scalable to the whole genome, with the\nexception of LDhot [24, 25], so we limit our comparison to this latter method. LDhot relies on a\ncomposite likelihood, which can be seen as an approximate likelihood for summaries of the data. It\ncan be computed only for a restricted set of models (i.e., an unstructured population with piecewise\nconstant population size), is unable to capture dependencies beyond those summaries, and scales\nat least cubically with the number of DNA sequences. The method we propose in this paper scales\nlinearly in the number of sequences while using raw genetic data directly.\n\n3 Methods\n\n3.1 Problem Setup\nLikelihood-free methods use coalescent simulators to draw parameters from the prior \u03b8(i) \u223c \u03c0(\u03b8)\nand then simulate data according to the coalescent x(i) \u223c P(x | \u03b8(i)), where i is the index of each\nsimulated dataset. Each population genetic datapoint x(i) \u2208 {0, 1}n\u00d7d typically takes the form of a\nbinary matrix, where rows correspond to individuals and columns indicate the presence of a Single\nNucleotide Polymorphism (SNP), a variable site in a DNA sequence1. Our goal is to learn the posterior\nP(\u03b8 | xobs), where \u03b8 is the parameter of interest and xobs is the observed data. For unstructured\npopulations the order of individuals carries no information, hence the rows are exchangeable. More\nn ) \u223c P(x | \u03b8(i)) and\nconcretely, given data X = (x(1), . . . x(N )) where x(i) := (x(i)\nj \u2208 {0, 1}d, we call X exchangeably-structured if, for every i, the distribution over the rows of a\nx(i)\nsingle datapoint is permutation-invariant\n\n1 , . . . , x(i)\n\nP(cid:16)\n\nn | \u03b8(i)(cid:17)\n\n= P(cid:16)\n\nx(i)\n1 , . . . , x(i)\n\nx(i)\n\u03c3(1), . . . , x(i)\n\n\u03c3(n) | \u03b8(i)(cid:17)\n\n,\n\nfor all permutations \u03c3 of the indices {1, . . . , n}. For inference, we propose iterating the following\nalgorithm.\n\n1. Simulation-on-the-\ufb02y: Sample a fresh minibatch of \u03b8(i) and x(i) from the prior and coales-\n\ncent simulator.\n\n2. Exchangeable neural network: Learn the posterior P(\u03b8(i) | x(i)) via an exchangeable\n\nmapping with x(i) as the input and \u03b8(i) as the label.\n\nThis framework can then be applied to learn the posterior of the evolutionary model parameters given\nxobs. The details on the two building blocks of our method, namely the exchangeable neural network\nand the simulation-on-the-\ufb02y paradigm, are given in Section 3.2 and 3.3, respectively.\n\n1Sites that have > 2 bases are rare and typically removed. Thus, a binary encoding can be used.\n\n3\n\n\f3.2 Exchangeable Neural Network\nThe goal of the exchangeable neural network is to learn the function f : {0, 1}n\u00d7d \u2192 P\u0398, where\n\u0398 is the space of all parameters \u03b8 and P\u0398 is the space of all probability distributions on \u0398. We\nparameterize the exchangeable neural network by applying the same function to each row of the\nbinary matrix, then applying a symmetric function to the output of each row, \ufb01nally followed by yet\nanother function mapping from the output of the symmetric function to a posterior distribution. More\nconcretely,\n\nf (x) := (h \u25e6 g)(cid:0)\u03a6(x1), . . . , \u03a6(xn)(cid:1),\n\nwhere \u03a6 : {0, 1}d \u2192 Rd1 is a function parameterized by a convolutional neural network, g :\nRn\u00d7d1 \u2192 Rd2 is a symmetric function, and h : Rd2 \u2192 P\u0398 is a function parameterized by a fully\nconnected neural network. A variant of this representation is proposed by [18] and [20]. See Figure 1\nfor an example. Throughout the paper, we choose g to be the mean of the element-wise top decile,\nsuch that d1 = d2 in order to allow for our method to be robust to changes in n at test time. Many\nother symmetric functions such as the element-wise sum, element-wise max, lexicographical sort, or\nhigher-order moments can be employed.\nThis exchangeable neural network has many advantages. While it could be argued that \ufb02exible\nmachine learning models could learn the structured exchangeability of the data, encoding exchange-\nability explicitly allows for faster per-iteration computation and improved learning ef\ufb01ciency, since\ndata augmentation for exchangeability scales as O(n!). Enforcing exchangeability implicitly reduces\nthe size of the input space from {0, 1}n\u00d7d to the quotient space {0, 1}n\u00d7d/Sn, where Sn is the\nsymmetric group on n elements. A factorial reduction in input size leads to much more tractable\ninference for large n. In addition, choices of g where d2 is independent of n (e.g., quantile operations\nwith output dimension independent of n) allows for an inference procedure which is robust to differing\nnumber of exchangeable variables between train and test time. This property is particularly desirable\nfor performing inference with missing data.\n\n3.3 Simulation-on-the-\ufb02y\n\nSupervised learning methods traditionally use a \ufb01xed training set and make multiple passes over the\ndata until convergence. This training paradigm typically can lead to a few issues: poorly calibrated\nposteriors and over\ufb01tting. While the latter has largely been tackled by regularization methods and\nlarge datasets, the former has not been suf\ufb01ciently addressed. We say a posterior is calibrated if for\nXq,A := {x | \u02c6p(\u03b8 \u2208 A | x) = q}, we have Ex\u2208Xq,A [p(\u03b8 \u2208 A | x)] = q for all q, A. Poorly calibrated\nposteriors are particularly an issue in scienti\ufb01c disciplines as scientists often demand methods with\ncalibrated uncertainty estimates in order to measure the con\ufb01dence behind new scienti\ufb01c discoveries\n(often leading to reliance on traditional methods with asymptotic guarantees such as MCMC).\nWhen we have access to scienti\ufb01c simulators, the amount of training data available is limited only\nby the amount of compute time available for simulation, so we propose simulating each training\ndatapoint afresh such that there is exactly one epoch over the training data (i.e., no training point is\npassed through the neural network more than once). We refer to this as simulation-on-the-\ufb02y. Note\nthat this can be relaxed to pass each training point a small constant number of times in the case of\ncomputational constraints on the simulator. This approach guarantees properly calibrated posteriors\nand obviates the need for regularization techniques to address over\ufb01tting. Below we justify these\nproperties through the lens of statistical decision theory.\nMore formally, de\ufb01ne the Bayes risk for prior \u03c0(\u03b8) as R\u2217\n\u03c0 = inf T ExE\u03b8\u223c\u03c0[l(\u03b8, T (x)], with l being\nthe loss function and T an estimator. The excess risk over the Bayes risk resulting from an algorithm\nA with model class F can be decomposed as\nR\u03c0( \u02dcfA) \u2212 R\u03c0( \u02c6f )\n\nR\u03c0( \u02dcfA) \u2212 R\u2217\n\nR\u03c0( \u02c6f ) \u2212 inf\n\n+\n\n+\n\n\u03c0 =\n\nf\u2208F R\u03c0(f )\n\n(cid:123)(cid:122)\n\nf\u2208F R\u03c0(f ) \u2212 R\u2217\n\ninf\n\n\u03c0\n\n(cid:123)(cid:122)\n\n(cid:17)\n(cid:125)\n\n(cid:16)\n(cid:124)\n\n(cid:16)\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:17)\n(cid:125)\n\n(cid:16)\n(cid:124)\n\n(cid:17)\n(cid:125)\n\n,\n\noptimization error\n\nestimation error\n\napproximation error\n\nwhere \u02dcfA and \u02c6f are the function obtained via algorithm A and the empirical risk minimizer, re-\nspectively. The terms on the right hand side are referred to as the optimization, estimation, and\napproximation errors, respectively. Often the goal of statistical decision theory is to minimize the\nexcess risk motivating algorithmic choices to control the three sources of error. For example, with\n\n4\n\n\fsupervised learning, over\ufb01tting is a result of large estimation error. Typically, for a suf\ufb01ciently\nexpressive neural network optimized via stochastic optimization techniques, the excess risk is domi-\nnated by optimization and estimation errors. Simulation-on-the-\ufb02y guarantees that the estimation\nerror is small, and as neural networks typically have small approximation error, we can conclude\nthat the main source of error remaining is the optimization error. It has been shown that smooth\npopulation risk surfaces can induce jagged empirical risk surfaces with many local minima [26, 27].\nWe con\ufb01rmed this phenomenon empirically in the population genetic setting(Section 5) showing\nthat the risk surface is much smoother in the on-the-\ufb02y setting than the \ufb01xed training setting. This\nreduces the number of poor local minima and, consequently, the optimization error. The estimator\ncorresponding to the Bayes risk (for the cross-entropy or KL-divergence loss function) is the posterior.\nThus, the simulation-on-the-\ufb02y training paradigm guarantees generalization and calibrated posteriors\n(assuming small optimization error).\n\n4 Statistical Properties\n\nThe most widely-used likelihood-free inference method is ABC. In this section we brie\ufb02y review\nABC and show that our method exhibits the same theoretical guarantees together with a set of\nadditional desirable properties.\n\nProperties of ABC Let xobs be the observed dataset, S be the summary statistic, and d be a\ndistance metric. The algorithm for vanilla rejection ABC is as follows. Denoting by i each simulated\ndataset, for i = 1 . . . N,\n\n1. Simulate \u03b8(i) \u223c \u03c0(\u03b8) and x(i) \u223c P(x | \u03b8(i))\n2. Keep \u03b8(i) if d(S(x(i)), S(xobs)) \u2264 \u0001.\n\nThe output provides an empirical estimate of the posterior. Two key results regarding ABC make it\nan attractive method for Bayesian inference: (1) Asymptotic guarantee: As \u0001 \u2192 0, N \u2192 \u221e, and\nif S is suf\ufb01cient, the estimated posterior converges to the true posterior (2) Calibration of ABC:\nA variant of ABC (noisy ABC in [28]) which injects noise into the summary statistic function is\ncalibrated. For detailed proofs as well as more sophisticated variants, see [28]. Note that ABC is\nnotoriously dif\ufb01cult to perform diagnostics on without the ground truth posterior as many factors\ncould contribute to a poor posterior approximation: poor choice of summary statistics, incorrect\ndistance metric, insuf\ufb01cient number of samples, or large \u0001.\n\nProperties of Our Method Our method matches both theoretical guarantees of ABC \u2014 (1) asymp-\ntotics and (2) calibration \u2014 while also exhibiting additional properties: (3) amortized inference, (4)\nno dependence on user-de\ufb01ned summary statistics, and (5) straightforward diagnostics. While the\nindependence of summary statistics and calibration are theoretically justi\ufb01ed in Section 3.2 and 3.3,\nwe provide some results that justify the asymptotics, amortized inference, and diagnostics.\nIn the simulation-on-the-\ufb02y setting, convergence to a global minimum implies that a suf\ufb01ciently large\nneural network architecture represents the true posterior within \u0001-error in the following sense: for any\n\ufb01xed error \u0001, there exist H0 and N0 such that the trained neural network produces a posterior which\nsatis\ufb01es\n\n(cid:17)(cid:105)\n\n(cid:104)\n\nEx\n\nmin\n\nw\n\n(cid:16)P(\u03b8 | x)(cid:13)(cid:13) P(N )\n\nKL\n\nDL (\u03b8 | x; w, H)\n\n< \u0001,\n\n(1)\n\nfor all H > H0 and N > N0, where H is the minimum number of hidden units across all neural\nnetwork layers, N is the number of training points, w the weights parameterizing the network, and\nKL the Kullback\u2013Leibler divergence between the population risk and the risk of the neural network.\nUnder these assumptions, the following proposition holds.\nProposition 1. For any x, \u0001 > 0, and \ufb01xed error \u03b4 > 0, there exists an H > H0, and N > N0 such\nthat,\n\n< \u03b4\n\n(2)\n\n(cid:16)P(\u03b8 | x)(cid:13)(cid:13) P(N )\n\n(cid:17)\nDL (\u03b8 | x; w\u2217, H)\nKL\n\u03b4 , where w\u2217 is the minimizer of (1).\n\nwith probability at least 1 \u2212 \u0001\n\nWe can get stronger guarantees in the discrete setting common to population genetic data.\n\n5\n\n\fFigure 1: A cartoon schematic of the exchangeable architecture for population genetics.\n\nCorollary 1. Under the same conditions, if x is discrete and P(x) > 0 for all x, the KL divergence\nappearing in (2) converges to 0 uniformly in x, as H, N \u2192 \u221e.\nThe proofs are given in the supplement. These results exhibit both the asymptotic guarantees of\nour method and show that such guarantees hold for all x (i.e. amortized inference). Diagnostics for\nthe quality of the approximation can be performed via hyperparameter optimization to compare the\nrelative loss of the neural network under a variety of optimization and architecture settings.\n\n5 Empirical Study: Recombination Hotspot Testing\n\nIn this section, we study the accuracy of our framework to test for recombination hotspots. As very\nfew hotspots have been experimentally validated, we primarily evaluate our method on simulated\ndata, with parameters set to match human data. The presence of ground truth allows us to benchmark\nour method and compare against LDhot (additional details on LDhot in the supplement). For the\nposterior in this classi\ufb01cation task (hotspot or not), we use the softmax probabilities. Unless otherwise\nspeci\ufb01ed, for all experiments we use the mutation rate, \u00b5 = 1.1\u00d7 10\u22128 per generation per nucleotide,\nconvolution patch length of 5 SNPs, 32 and 64 convolution \ufb01lters for the \ufb01rst two convolution layers,\n128 hidden units for both fully connected layers, and 20-SNP length windows. The experiments\ncomparing against LDhot used sample size n = 64 to construct lookup tables for LDhot quickly.\nAll other experiments use n = 198, matching the size of the CEU population (i.e., Utah Residents\nwith Northern and Western European ancestry) in the 1000 Genomes dataset. All simulations were\nperformed using msprime [29]. Gradient updates were performed using Adam [30] with learning\nrate 1 \u00d7 10\u22123 \u00d7 0.9b/10000, b being the batch count. In addition, we augment the binary matrix, x, to\ninclude the distance information between neighboring SNPs in an additional channel resulting in a\ntensor of size n \u00d7 d \u00d7 2.\n\n5.1 Recombination Hotspot Details\n\nRecombination hotspots are short regions of the genome with high recombination rate relative to the\nbackground. As the recombination rate between two DNA locations tunes the correlation between\ntheir corresponding genealogies, hotspots play an important role in complex disease inheritance\npatterns. In order to develop accurate methodology, a precise mathematical de\ufb01nition of a hotspot\nneeds to be speci\ufb01ed in accordance with the signatures of biological interest. We use the following:\nDe\ufb01nition 1 (Recombination Hotspot). Let a window over the genome be subdivided into three\nsubwindows w = (wl, wh, wr) with physical distances (i.e., window widths) \u03b1l, \u03b1h, and \u03b1r, respec-\ntively, where wl, wh, wr \u2208 G where G is the space over all possible subwindows of the genome. Let\na mean recombination map R : G \u2192 R+ be a function that maps from a subwindow of the genome\nto the mean recombination rate per base pair in the subwindow. A recombination hotspot for a given\nmean recombination map R is a window w which satis\ufb01es the following properties:\n\n1. Elevated local recombination rate: R(wh) > k \u00b7 max(cid:0)R(wl), R(wr)(cid:1)\n\n2. Large absolute recombination rate: R(wh) > k\u02dcr\n\nwhere \u02dcr is the median (at a per base pair level) genome-wide recombination rate, and k > 1 is the\nrelative hotspot intensity.\n\nThe \ufb01rst property is necessary to enforce the locality of hotspots and rule out large regions of high\nrecombination rate, which are typically not considered hotspots by biologists. The second property\nrules out regions of minuscule background recombination rate in which sharp relative spikes in\n\n6\n\n\fFigure 2: (Left)Accuracy comparison between exchangeable vs nonexchangeable architectures.\n(Right)Performance of changing the number of individuals at test time for varying training sample\nsizes.\nrecombination still remain too small to be biologically interesting. The median is chosen here to be\nrobust to the right skew of the distribution of recombination rates. Typically, for the human genome\nwe use \u03b1l = \u03b1r = 13 kb, \u03b1h = 2 kb, and k = 10 based on experimental \ufb01ndings.\n\n5.2 Evaluation of Exchangeable Neural Network\n\nWe compare the behavior of an explicitly exchangeable architecture to a nonexchangeable archi-\ntecture that takes 2D convolutions with varying patch heights. The accuracy under human-like\npopulation genetic parameters with varying 2D patch heights is shown in the left panel of Figure 2.\nSince each training point is simulated on-the-\ufb02y, data augmentation is performed implicitly in the\nnonexchangeable version without having to explicitly permute the rows of each training point. As\nexpected, directly encoding the permutation invariance leads to more ef\ufb01cient training and higher\naccuracy while also bene\ufb01ting from a faster per-batch computation time. Furthermore, the slight\naccuracy decrease when increasing the patch height con\ufb01rms the dif\ufb01culty of learning permutation\ninvariance as n grows. Another advantage of exchangeable architectures is the robustness to the\nnumber of individuals at test time. As shown in right panel of Figure 2, the accuracy remains above\n90% during test time for sample sizes roughly 0.1\u201320\u00d7 the train sample size.\n\n5.3 Evaluation of Simulation-on-the-\ufb02y\n\nNext, we analyze the effect of simulation-on-the-\ufb02y in comparison to the standard \ufb01xed training set. A\n\ufb01xed training set size of 10000 was used and run for 20000 training batches and a test set of size 5000.\nFor a network using simulation-on-the-\ufb02y, 20000 training batches were run and evaluated on the\nsame test set. In other words, we ran both the simulation on-the-\ufb02y and \ufb01xed training set for the same\nnumber of iterations with a batch size of 50, but the simulation-on-the-\ufb02y draws a fresh datapoint\nfrom the generative model upon each update so that no datapoint is used more than once. The weights\nwere initialized with a \ufb01xed random seed in both settings with 20 replicates. Figure 3 (left) shows\nthat the \ufb01xed training set setting has both a higher bias and higher variance than simulation-on-the-\ufb02y.\nThe bias can be attributed to the estimation error of a \ufb01xed training set in which the empirical risk\nsurface is not a good approximation of the population risk surface. The variance can be attributed to\nan increase in the number of poor quality local optima in the \ufb01xed training set case.\nWe next investigated posterior calibration. This gives us a measure for whether there is any bias in the\nuncertainty estimates output by the neural network. We evaluated the calibration of simulation-on-\nthe-\ufb02y against using a \ufb01xed training set of 10000 datapoints. The calibration curves were generated\nby evaluating 25000 datapoints at test time and binning their posteriors, computing the fraction of\ntrue labels for each bin. A perfectly calibrated curve is the dashed black line shown in Figure 3 (right).\nIn accordance with the theory in Section 3.3, the simulation-on-the-\ufb02y is much better calibrated with\nan increasing number of training examples leading to a more well calibrated function. On the other\nhand, the \ufb01xed training procedure is poorly calibrated.\n5.4 Comparison to LDhot\n\nWe compared our method against LDhot in two settings: (i) sampling empirical recombination rates\nfrom the HapMap recombination map for CEU and YRI (i.e., Yoruba in Ibadan, Nigera) [31] to set the\n\n7\n\n10020030040050080859095100Testing Size%AccuracyTrain n = 32Train n = 64Train n = 128Train n = 256Train n = 512\fFigure 3: (Left)Comparison between the test cross entropy of a \ufb01xed training set of size 10000 and\nsimulation-on-the-\ufb02y. (Right)Posterior calibration. The black dashed line is a perfectly calibrated\ncurve. The red and purple lines are for simulation-on-the-\ufb02y after 20k and 60k iterations; the blue\nand green lines for a \ufb01xed training set of 10k points, for 20k and 60k iterations.\n\nFigure 4: (Left) ROC curve in the CEU and YRI setting for the deep learning and LDhot method.\nThe black line represents a random classi\ufb01er. (Middle) Windows of the HapMap recombination\nmap drawn based on whether they matched up with our hotspot de\ufb01nition. The blue and green line\ncoincide almost exactly. (Right) The inferred posteriors for the continuous case. The circles represent\nthe mean of the posterior and the bars represent the 95% credible interval. The green line shows\nwhen the true heat is equal to the inferred heat.\n\nbackground recombination rate, and then using this background to simulate a \ufb02at recombination map\nwith 10 \u2013 100\u00d7 relative hotspot intensity, and (ii) sampling segments of the HapMap recombination\nmap for CEU and YRI and classifying them as hotspot according to our de\ufb01nition, then simulating\nfrom the drawn variable map.\nThe ROC curves for both settings are shown in Figure 4. Under the bivariate empirical background\nprior regime where there is a \ufb02at background rate and \ufb02at hotspot, both methods performed quite\nwell as shown on the left panel of Figure 4. We note that the slight performance decrease for YRI\nwhen using LDhot is likely due to hyperparameters that require tuning for each population size.\nThis bivariate setting is the precise likelihood ratio test for which LDhot tests. However, as \ufb02at\nbackground rates and hotspots are not realistic, we sample windows from the HapMap recombination\nmap and label them according to a more suitable hotspot de\ufb01nition that ensures locality and rules\nout neglectable recombination spikes. The middle panel of Figure 4 uses the same hotspot de\ufb01nition\nin the training and test regimes, and is strongly favorable towards the deep learning method. Under\na sensible de\ufb01nition of recombination hotspots and realistic recombination maps, our method still\nperforms well while LDhot performs almost randomly. We believe that the true performance of\nLDhot is somewhere between the \ufb01rst and second settings, with performance dominated by the deep\nlearning method. Importantly, this improvement is achieved without access to any problem-speci\ufb01c\nsummary statistics.\nOur approach reached 90% accuracy in fewer than 2000 iterations, taking approximately 0.5 hours\non a 64 core machine with the computational bottleneck due to the msprime simulation [29]. For\nLDhot, the two-locus lookup table for variable population size using the LDpop fast approximation\n\n8\n\nTest Cross EntropyFixed Training SetSimulation-on-the-Fly0.000.250.500.751.001.251.501.752.00020406080100050100150200True HeatPosterior Mean and 95% Credible Interval\f[32] took 9.5 hours on a 64 core machine (downsampling n = 198 from N = 256). The lookup table\nhas a computational complexity of O(n3) while per-iteration training of the neural network scales\nas O(n), allowing for much larger sample sizes. In addition, our method scales well to large local\nregions, being able to easily handle 800-SNP windows.\n\n5.5 Recombination Hotspot Intensity Estimation: The Continuous Case\n\nTo demonstrate the \ufb02exibility of our method in the continuous parameter regime, we adapted our\nmethod to the problem of estimating the intensity (or heat) of a hotspot. The problem setup \ufb01xes\nthe background recombination rate R(wl) = R(wr) = 0.0005 and seeks to estimate the relative\nhotspot recombination intensity k. The demography is set to that of CEU. The hotspot intensity k\nwas simulated with a uniform distributed prior from 1 to 100.\nFor continuous parameters, arbitrary posteriors cannot be simply parameterized by a vector with\ndimension in the number of classes as was done in the discrete parameter setting. Instead, an\napproximate posterior distribution from a nice distribution family is used to get uncertainty estimates\nof our parameter of interest. This is achieved by leveraging our exchangeable network to output\nparameter estimates for the posterior distribution as done in [33]. For example, if we use a normal\ndistribution as our approximate posterior, the network outputs estimates of the mean and precision.\nThe corresponding loss function is the negative log-likelihood\n\n\u2212 log p(k|x) = \u2212 log \u03c4 (x)\n\n2\n\n+\n\n\u03c4 (x)(k \u2212 \u00b5(x))2\n\n2\n\n+ const,\n\n(3)\n\nwhere \u00b5 and \u03c4 are the mean and the precision of the posterior, respectively. More \ufb02exible distribution\nfamilies such as a Gaussian mixture model can be used for a better approximation to the true posterior.\nWe evaluate our method in terms of calibration and quality of the point estimates to check that our\nmethod yields valid uncertainty estimates. The right panel of Figure 4 shows the means and 95%\ncredible intervals inferred by our method using log-normal as the approximate posterior distribution.\nAs a measure of the calibration of the posteriors, the true intensity fell inside the 95% credible interval\n97% of the time over a grid of 500 equally spaced points between k = 1 to 100. We measure the\nquality of the point estimates with the Spearman correlation between the 500 equally spaced points\ntrue heats and the estimated mean of the posteriors which yielded 0.697. This was improved by using\na Gaussian mixture model with 10 components to 0.782. This illustrates that our method can be\neasily adapted to estimate the posterior distribution in the continuous regime.\n\n6 Discussion\n\nWe have proposed the \ufb01rst likelihood-free inference method for exchangeable population genetic\ndata that does not rely on handcrafted summary statistics. To achieve this, we designed a family of\nneural networks that learn an exchangeable representation of population genetic data, which is in turn\nmapped to the posterior distribution over the parameter of interest. Our simulation-on-the-\ufb02y training\nparadigm produced calibrated posterior estimates. State-of-the-art accuracy was demonstrated on the\nchallenging problem of recombination hotspot testing.\nThe development and application of exchangeable neural networks to fully harness raw sequence data\naddresses an important challenge in applying machine learning to population genomics. The standard\npractice to reduce data to ad hoc summary statistics, which are then later plugged into a standard\nmachine learning pipelines, is well recognized as a major shortcoming. Within the population genetic\ncommunity, our method proves to be a major advance in likelihood-free inference in situations where\nABC is too inaccurate. Several works have applied ABC to different contexts, and each one requires\ndevising a new set of summary statistics. Our method can be extended in a black-box manner to these\nsituations, which include inference on point clouds and quantifying evolutionary events.\n\nAcknowledgements\n\nWe thank Ben Graham for helpful discussions and Yuval Simons for his suggestion to use the decile.\nThis research is supported in part by an NSF Graduate Research Fellowship (JC); EPSRC grants\nEP/L016710/1 (VP) and EP/L018497/1 (PJ); an NIH grant R01-GM094402 (JC, JPS, SM, and YSS);\nand a Packard Fellowship for Science and Engineering (YSS). YSS is a Chan Zuckerberg Biohub\n\n9\n\n\finvestigator. We gratefully acknowledge the support of NVIDIA Corporation with the donation of\nthe Titan X Pascal GPU used for this research. This research also used resources of the National\nEnergy Research Scienti\ufb01c Computing Center, a DOE Of\ufb01ce of Science User Facility supported by\nthe Of\ufb01ce of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.\n\n10\n\n\fReferences\n[1] J. F. C. Kingman. The coalescent. Stochastic processes and their applications, 13(3):235\u2013248,\n\n1982.\n\n[2] M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian computation in popula-\n\ntion genetics. Genetics, 162(4):2025\u20132035, 2002.\n\n[3] S. Sheehan and Y. S. Song. Deep learning for population genetic inference. PLoS Computational\n\nBiology, 12(3):e1004845, 2016.\n\n[4] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks.\n\narXiv:1706.04599, 2017.\n\n[5] J K Pritchard, M T Seielstad, A Perez-Lezaun, and M W Feldman. Population growth of human\nY chromosomes: a study of Y chromosome microsatellites. Mol Biol Evol, 16(12):1791\u20138,\n1999.\n\n[6] T. D. Petes. Meiotic recombination hot spots and cold spots. Nature Reviews Genetics, 2(5):360\u2013\n\n369, 2001.\n\n[7] J. Hey. What\u2019s so hot about recombination hotspots? PLoS Biol, 2(6):e190, 2004.\n\n[8] S. Boitard, W. Rodr\u00edguez, F. Jay, S. Mona, and F. Austerlitz. Inferring population size history\nfrom large samples of genome-wide molecular data-an approximate bayesian computation\napproach. PLoS genetics, 12(3):e1005877, 2016.\n\n[9] D. Wegmann, C. Leuenberger, and L. Excof\ufb01er. Ef\ufb01cient Approximate Bayesian Computation\ncoupled with Markov chain Monte Carlo without likelihood. Genetics, 182(4):1207\u20131218,\n2009.\n\n[10] V. C. Sousa, M. Fritz, M. A. Beaumont, and L. Chikhi. Approximate Bayesian computation\n\nwithout summary statistics: The case of admixture. Genetics, 181(4):1507\u20131519, 2009.\n\n[11] D. R. Schrider and A. D. Kern. Inferring selective constraint from population genomic data sug-\ngests recent regulatory turnover in the human brain. Genome biology and evolution, 7(12):3511\u2013\n3528, 2015.\n\n[12] P. Pavlidis, J. D. Jensen, and W. Stephan. Searching for footprints of positive selection in\n\nwhole-genome snp data from nonequilibrium populations. Genetics, 185(3):907\u2013922, 2010.\n\n[13] MGB Blum and O Fran\u00e7ois. Non-linear regression models for Approximate Bayesian Compu-\n\ntation. Statistics and Computing, 20(1):63\u201373, 2010.\n\n[14] Lex Flagel, Yaniv J Brandvain, and Daniel R Schrider. The unreasonable effectiveness of\n\nconvolutional neural networks in population genetic inference. bioRxiv, page 336073, 2018.\n\n[15] B. Jiang, T.-y. Wu, C. Zheng, and W.H. Wong. Learning summary statistic for approximate\n\nBayesian computation via deep neural network. arXiv:1510.02175, 2015.\n\n[16] G. Papamakarios and I. Murray. Fast \u0001-free inference of simulation models with Bayesian\n\nconditional density estimation. arXiv:1605.06376, 2016.\n\n[17] P. K. Shivaswamy and T. Jebara. Permutation invariant svms. In International Conference on\n\nMachine Learning, pages 817\u2013824, 2006.\n\n[18] S. Ravanbakhsh, J. Schneider, and B. Poczos. Deep learning with sets and point clouds.\n\narXiv:1611.04500, 2016.\n\n[19] N. Guttenberg, N. Virgo, O. Witkowski, H. Aoki, and R. Kanai. Permutation-equivariant neural\n\nnetworks applied to dynamics prediction. arXiv:1612.04530, 2016.\n\n[20] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola. Deep sets.\n\nNeural Information Processing Systems, 2017.\n\n11\n\n\f[21] P. Fearnhead.\n\n22:3061\u20133066, 2006.\n\nSequenceLDhot:\n\ndetecting recombination hotspots.\n\nBioinformatics,\n\n[22] J. Li, M. Q. Zhang, and X. Zhang. A new method for detecting human recombination hotspots\nand its applications to the hapmap encode data. The American Journal of Human Genetics,\n79(4):628\u2013639, 2006.\n\n[23] Y. Wang and B. Rannala. Population genomic inference of recombination rates and hotspots.\n\nProceedings of the National Academy of Sciences, 106(15):6215\u20136219, 2009.\n\n[24] A. Auton, S. Myers, and G. McVean. Identifying recombination hotspots using population\n\ngenetic data. arXiv: 1403.4264, 2014.\n\n[25] J. D. Wall and L. S. Stevison. Detecting recombination hotspots from patterns of linkage\n\ndisequilibrium. G3: Genes, Genomes, Genetics, 2016.\n\n[26] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with\n\ngaussian inputs. arXiv preprint arXiv:1702.07966, 2017.\n\n[27] Chi Jin, Lydia T Liu, Rong Ge, and Michael I Jordan. Minimizing nonconvex population risk\n\nfrom rough empirical risk. arXiv preprint arXiv:1803.09357, 2018.\n\n[28] P. Fearnhead and D. Prangle. Constructing summary statistics for approximate Bayesian com-\nputation: semi-automatic approximate Bayesian computation. Journal of the Royal Statistical\nSociety: Series B (Statistical Methodology), 74(3):419\u2013474, 2012.\n\n[29] J. Kelleher, A. M. Etheridge, and G. McVean. Ef\ufb01cient coalescent simulation and genealogical\n\nanalysis for large sample sizes. PLoS computational biology, 12(5):e1004842, 2016.\n\n[30] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.\n\n[31] R. A. Gibbs, J. W. Belmont, P. Hardenbol, T. D. Willis, F. Yu, H. Yang, L.-Y. Ch\u2019ang, W. Huang,\n\nB. Liu, Y. Shen, et al. The international hapmap project. Nature, 426(6968):789\u2013796, 2003.\n\n[32] J. A. Kamm, J. P. Spence, J. Chan, and Y. S. Song. Two-locus likelihoods under variable\npopulation size and \ufb01ne-scale recombination rate estimation. Genetics, 203(3):1381\u20131399,\n2016.\n\n[33] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable\npredictive uncertainty estimation using deep ensembles. In Advances in Neural Information\nProcessing Systems, pages 6402\u20136413, 2017.\n\n12\n\n\f", "award": [], "sourceid": 5189, "authors": [{"given_name": "Jeffrey", "family_name": "Chan", "institution": "UC Berkeley"}, {"given_name": "Valerio", "family_name": "Perrone", "institution": "University of Warwick"}, {"given_name": "Jeffrey", "family_name": "Spence", "institution": "UC Berkeley"}, {"given_name": "Paul", "family_name": "Jenkins", "institution": "University of Warwick"}, {"given_name": "Sara", "family_name": "Mathieson", "institution": "Swarthmore College"}, {"given_name": "Yun", "family_name": "Song", "institution": "UC Berkeley"}]}