{"title": "Information Constraints on Auto-Encoding Variational Bayes", "book": "Advances in Neural Information Processing Systems", "page_first": 6114, "page_last": 6125, "abstract": "Parameterizing the approximate posterior of a generative model with neural networks has become a common theme in recent machine learning research. While providing appealing flexibility, this approach makes it difficult to impose or assess structural constraints such as conditional independence. We propose a framework for learning representations that relies on Auto-Encoding Variational Bayes and whose search space is constrained via kernel-based measures of independence. In particular, our method employs the $d$-variable Hilbert-Schmidt Independence Criterion (dHSIC) to enforce independence between the latent representations and arbitrary nuisance factors.\nWe show how to apply this method to a range of problems, including the problems of learning invariant representations and the learning of interpretable representations. We also present a full-fledged application to single-cell RNA sequencing (scRNA-seq). In this setting the biological signal in mixed in complex ways with sequencing errors and sampling effects. We show that our method out-performs the state-of-the-art in this domain.", "full_text": "Information Constraints on Auto-Encoding Variational Bayes\n\nRomain Lopez1, Jeffrey Regier1, Michael I. Jordan1,2, and Nir Yosef1,3,4\n\n{romain_lopez, regier, niryosef}@berkeley.edu\n\njordan@cs.berkeley.edu\n\n1Department of Electrical Engineering and Computer Sciences, University of California, Berkeley\n\n2Department of Statistics, University of California, Berkeley\n\n3Ragon Institute of MGH, MIT and Harvard\n\n4Chan-Zuckerberg Biohub\n\nAbstract\n\nParameterizing the approximate posterior of a generative model with neural net-\nworks has become a common theme in recent machine learning research. While\nproviding appealing \ufb02exibility, this approach makes it dif\ufb01cult to impose or assess\nstructural constraints such as conditional independence. We propose a framework\nfor learning representations that relies on auto-encoding variational Bayes, in\nwhich the search space is constrained via kernel-based measures of independence.\nIn particular, our method employs the d-variable Hilbert-Schmidt Independence\nCriterion (dHSIC) to enforce independence between the latent representations and\narbitrary nuisance factors. We show how this method can be applied to a range\nof problems, including problems that involve learning invariant and conditionally\nindependent representations. We also present a full-\ufb02edged application to single-\ncell RNA sequencing (scRNA-seq). In this setting the biological signal is mixed\nin complex ways with sequencing errors and sampling effects. We show that our\nmethod outperforms the state-of-the-art approach in this domain.\n\n1\n\nIntroduction\n\nSince the introduction of variational auto-encoders (VAEs) [1], graphical models whose conditional\ndistribution are speci\ufb01ed by deep neural networks have become commonplace. For problems where\nall that matters is the goodness-of-\ufb01t (e.g., marginal log probability of the data), there is little reason\nto constrain the \ufb02exibility/expressiveness of these networks other than possible considerations of\nover\ufb01tting. In other problems, however, some latent representations may be preferable to others\u2014\nfor example, for reasons of interpretability or modularity. Traditionally, such constraints on latent\nrepresentations have been expressed in the graphical model setting via conditional independence\nassumptions. But these assumptions are relatively rigid, and with the advent of highly \ufb02exible\nconditional distributions, it has become important to \ufb01nd ways to constrain latent representations that\ngo beyond the rigid conditional independence structures of classical graphical models.\nIn this paper, we propose a new method for restricting the search space to latent representations with\ndesired independence properties. As in [1], we approximate the posterior for each observation X\n\nwith an encoder network that parameterizes q\u03c6(Z X). Restricting this search space amounts to\nHere pdata(X) denotes the empirical distribution. We aim to enforce independence statements of the\nform \u02c6q\u03c6(Z i)\u00c6 \u02c6q\u03c6(Z j), where i and j are different coordinates of our latent representation.\n\nconstraining the class of variational distributions that we consider. In particular, we aim to constrain\nthe aggregated variational posterior [2]:\n\n\u02c6q\u03c6(Z)\u2236= Epdata(X)[q\u03c6(Z X)] .\n\n(1)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fv\n\nu\n\nz\n\ns\n\nx\n\nx\n\nz\n\nx\n\nu\n\ns\n\n(a) Learning Interpretable\n\nRepresentations\n\n(b) Learning Invariant\n\nRepresentations\n\n(c) Learning Denoised\n\nRepresentations\n\nFigure 1: Tasks presented in the paper.\n\nUnfortunately, because \u02c6q\u03c6(Z) is a mixture distribution, computing any standard measure of indepen-\nbound on log p(X). Maximizing it amounts to maximizing the traditional variational lower bound\n\ndence is intractable, even in the case of Gaussian terms [3]. In this paper, we circumvent this problem\nin a novel way. First, we estimate dependency though a kernel-based measure of independence,\nin particular the Hilbert-Schmidt Information Criterion (HSIC) [4]. Second, by scaling and then\nsubtracting this measure of dependence in the variational lower bound, we get a new variational lower\n\nwith a penalty for deviating from the desired independence conditions. We refer to this approach as\nHSIC-constrained VAE (HCV).\nThe remainder of the paper is organized as follows. In Section 2, we provide background on VAEs\nand the HSIC. In Section 3, we precisely de\ufb01ne HCV and provide a theoretical analysis. The next\nthree sections each present an application of HVC\u2014one for each task shown in Figure 1. In Section 4,\nwe consider the problem of learning an interpretable latent representation, and we show that HCV\ncompares favorably to \u03b2-VAE [5] and \u03b2-TCVAE [6]. In Section 5, we consider the problem of\nlearning an invariant representation, showing both that HCV includes the variational fair auto-encoder\n(VFAE) [7] as a special case, and that it can improve on the VFAE with respect to its own metrics.\nIn Section 6, we denoise single-cell RNA sequencing data with HCV, and show that our method\nrecovers biological signal better than the current state-of-the-art approach.\n\n2 Background\n\nIn representation learning, we aim to transform a variable x into a representation vector z for which\na given downstream task can be performed more ef\ufb01ciently, either computationally or statistically.\nFor example, one may learn a low-dimensional representation that is predictive of a particular label y,\nas in supervised dictionary learning [8]. More generally, a hierarchical Bayesian model [9] applied to\na dataset yields stochastic representations, namely, the suf\ufb01cient statistics for the model\u2019s posterior\ndistribution. In order to learn representations that respect speci\ufb01c independence statements, we\nneed to bring together two independent lines of research. First, we will present brie\ufb02y variational\nauto-encoders and then non-parametric measures of dependence.\n\n2.1 Auto-Encoding Variational Bayes (AEVB)\n\nBayesian inference paradigm [10, 11]. Let{X, S} denote the set of observed random variables and\n\nWe focus on variational auto-encoders [1] which effectively summarize data for many tasks within a\n\nZ the set of hidden random variables (we will use the notation zi to denote the i-th random variable\nin the set Z). Then Bayesian inference aims to maximize the likelihood:\n\nBecause the integral is in general intractable, variational inference \ufb01nds a distribution q\u03c6(Z X, S)\n\nthat minimizes a lower bound on the data\u2014the evidence lower bound (ELBO):\n\nlog p\u03b8(X S)\u2265 Eq\u03c6(ZX,S) log p\u03b8(X Z, S)\u2212 DKL((q\u03c6(ZX, S) p(Z))\n\np\u03b8(X S)=S p\u03b8(X Z, S)dp(Z).\n\n(2)\n\n(3)\n\nIn auto-encoding variational Bayes (AEVB), the variational distribution is parametrized by a neural\nnetwork. In the case of a variational auto-encoder (VAE), both the generative model and the variational\napproximation have conditional distributions parametrized with neural networks. The difference\n\n2\n\n\f(4)\n\nposterior [13] or average encoding distribution [14].\n\nbetween the data likelihood and the ELBO is the variational gap:\n\n2.2 Non-parametric estimates of dependence with kernels\n\nAEVB has since been successfully applied and extended. One notable example is the semi-supervised\n\nmodel [12]. Here, the representation z1 both explains the original data and is predictive of the label\ny. More generally, solving an additional problem is tantamount to adding a node in the underlying\n\nDKL(q\u03c6(Z X, S) p\u03b8(Z X, S)).\nThe original AEVB framework is described in the seminal paper [1] for the case Z ={z}, X =\n{x}, S=\u089d. The representation z is optimized to \u201cexplain\u201d the data x.\nlearning case\u2014where Z ={z1, z2}, X ={x}, y\u2208 X\u222a Z\u2014which is addressed by the M1 + M2\ngraphical model. Finally, the variational distribution can be used to meet different needs: q\u03c6(y x) is\na classi\ufb01er and q\u03c6(z1 x) summarizes the data.\nWhen using AEVB, the empirical data distribution pdata(X, S) is transformed into the empirical\nrepresentation \u02c6q\u03c6(Z)= Epdata(X,S)q\u03c6(Z X, S). This mixture is commonly called the aggregated\nLet(\u2126,F, P) be a probability space. LetX (resp.Y) be a separable metric space. Let u\u2236 \u2126\u2192X\n(resp. v\u2236 \u2126\u2192Y) be a random variable. Let k\u2236X\u00d7X \u2192 R (resp. l\u2236Y\u00d7Y\u2192 R) be a continuous,\nbounded, positive semi-de\ufb01nite kernel. LetH (resp.K) be the corresponding reproducing kernel\nHilbert space (RKHS) and \u03c6\u2236 \u2126\u2192H (resp. \u03c8\u2236 \u2126\u2192K) the corresponding feature mapping.\nthe RKHSH as follows:\nIf the kernel k is universal1, then the mean embedding operator P\u0015 \u00b5P is injective [15].\nrepresentations. Such a distance, de\ufb01ned via the canonical distance between theirH-embeddings, is\ncalled the maximum mean discrepancy [16] and denoted MMD(P, Q).\nThe joint distribution P(u, v) de\ufb01ned over the product spaceX\u00d7Y can be embedded as a pointCuv\nin the tensor spaceH\u2297K. It can also be interpreted as a linear mapH\u2192K:\nSuppose the kernels k and l are universal. The largest eigenvalue of the linear operatorCuv is zero if\ncan therefore be derived from the Hilbert-Schmidt norm of the cross-covariance operatorCuv called\nthe Hilbert-Schmidt Independence Criterion (HSIC) [17]. Let(ui, vi)1\u2264i\u2264n denote a sequence of\niid copies of the random variable(u, v). In the case whereX = Rp andY= Rq, the V-statistics in\nEquation 7 yield a biased empirical estimate [15], which can be computed inO(n2(p+ q)) time. An\n\n\u2200(f, g)\u2208H\u00d7K, Ef(u)g(v)=\u001bf(u),Cuvg(v)\u001bH=\u001bf\u2297 g,Cuv\u001bH\u2297K.\n\nWe now introduce a kernel-based estimate of distance between two distributions P and Q over\nthe random variable u. This approach will be used by one of our baselines for learning invariant\n\nGiven this setting, one can embed the distribution P of random variable u into a single point \u00b5P of\n\nand only if the random variables u and v are marginally independent [4]. A measure of dependence\n\n\u00b5P =S\n\n\u03c6(u)P(du).\n\n\u2126\n\nnQ\n\ni,j\n\nk(ui, uj)l(vi, vj)+ 1\n\u2212 2\n\nn4\n\nnQ\nnQ\n\nk(ui, uj)l(vk, vl)\nk(ui, uj)l(vi, vk).\n\ni,j,k,l\n\nn3\n\ni,j,k\n\n(5)\n\n(6)\n\n(7)\n\nestimator for HSIC is\n\n\u02c6HSICn(P)= 1\n\nn2\n\nThe dHSIC [18, 19] generalizes the HSIC to d variables. We present the dHSIC in Appendix A.\n\n3 Theory for HSIC-Constrained VAE (HCV)\n\nThis paper is concerned with intepretability of representations learned via VAEs. Independence\nbetween certain components of the representation can aid in interpretability [6, 20]. First, we will\n\n1A kernel k is universal if k(x,\u22c5) is continuous for all x and the RKHS induced by k is dense in C(X). This\nis true for the Gaussian kernel(u, u\u2032)\u0015 e\u2212\u03b3u\u2212u\u20322 when \u03b3> 0.\n\n3\n\n\fexplain why AEVB might not be suitable for learning representations that satisfy independence\nstatements. Second, we will present a simple diagnostic in the case where the generative model is\n\ufb01xed. Third, we will introduce HSIC-constrained VAEs (HCV): our method to correct approximate\nposteriors learned via AEVB in order to recover independent representations.\n\n3.1\n\nIndependence and representations: Ideal setting\n\nindependence statements in the generative model are respected in the latent representation:\n\nThe goal of learning representation that satis\ufb01es certain independence statements can be achieved\nby adding suitable nodes and edges to the generative distribution graphical model. In particular,\nmarginal independence can be the consequence of an \u201cexplaining away\u201d pattern as in Figure 1a for\n\nthe triplet{u, x, v}. If we consider the setting of in\ufb01nite data and an accurate posterior, we \ufb01nd that\nProposition 1. Let us apply AEVB to a model p\u03b8(X, Z S) with independence statementI (e.g.,\nzi\u00c6 zj for some(i, j)). If the variational gap Epdata(X,S)DKL(q\u03c6(Z X, S) p\u03b8(Z X, S)) is\nzero, then under in\ufb01nite data the representation \u02c6q\u03c6(Z) satis\ufb01es statementI.\n(X, S) are high-dimensional. Also, AEVB is commonly used with a naive mean \ufb01eld approximation\nq\u03c6(Z X, S)=\u220fk q\u03c6(zk X, S), which could poorly match the real posterior. In the case of a VAE,\naggregated posterior Epdata(X,S)p\u03b8(Z X, S). Notably, the independence properties encoded by\nthe generative model p\u03b8(X S) will often not be respected by the approximate posterior. This is\n\nneural networks are also used to parametrize the conditional distributions of the generative model.\nThis makes it challenging to know whether naive mean \ufb01eld or any speci\ufb01c improvement [11, 21]\nis appropriate. As a consequence, the aggregated posterior could be quite different from the \u201cexact\u201d\n\nThe proof appears in Appendix B. In practice we may be far from the idealized in\ufb01nite setting if\n\nobserved empirically in [7], as well as Section 4 and Section 5 of this work.\n\n3.2 A simple diagnostic in the case of posterior approximation\n\nA theoretical analysis explaining why the empirical aggregated posterior presents some misspeci\ufb01ed\ncorrelation is not straightforward. The main reason is that the learning of the model parameters\n\u03b8 along with the variational parameters \u03c6 makes diagnosis hard. As a \ufb01rst line of attack, let us\nconsider the case where we approximate the posterior of a \ufb01xed model. Consider learning a posterior\n\nq\u03c6(Z X, S) via naive mean \ufb01eld AEVB. Recent work [22, 14, 13] focuses on decomposing the\n\nsecond term of the ELBO and identifying terms, one of which is the total correlation between hidden\nvariables in the aggregate posterior. This term, in principle, promotes independence. However, the\ndecomposition has numerous interacting terms, which makes exact interpretation dif\ufb01cult. As the\ngenerative model is \ufb01xed in this setting, optimizing the ELBO is tantamount to minimizing the\nvariational gap, which we propose to decompose as\n\nDKL(q\u03c6(Z X, S) p\u03b8(Z X, S))=Q\n\nDKL(q\u03c6(zk X, S) p\u03b8(zk X, S))\n\u220fk p\u03b8(zk X, S)\n+ Eq\u03c6(ZX,S) log\np\u03b8(Z X, S)\n\n.\n\nk\n\n(8)\n\nThe last term of this equation quanti\ufb01es the misspeci\ufb01cation of the mean-\ufb01eld assumption. The larger\nit is, the more the coupling between the hidden variables Z. Since neural networks are \ufb02exible, they\ncan be very successful at optimizing this variational gap but at the price of introducing supplemental\ncorrelation between Z in the aggregated posterior. We expect this side effect whenever we use neural\nnetworks to learn a misspeci\ufb01ed variational approximation.\n\n3.3 Correcting the variational posterior\n\nWe aim to correct the variational posterior q\u03c6(Z X, S) so that it satis\ufb01es speci\ufb01c independence\nstatements of the form\u2200(i, j) \u2208 S, \u02c6q\u03c6(zi) \u00c6 \u02c6q\u03c6(zj). As \u02c6q\u03c6(Z) is a mixture distribution, any\nstandard measure of independence is intractable based on the conditionals q\u03c6(Z X, S), even in the\nframework, let \u03bb\u2208 R+,Z0={zi1, .., zip}\u2282 Z andS0={sj1, .., sjq}\u2282 S. The HCV framework with\n\ncommon case of mixture of Gaussian distributions [3]. To address this issue, we propose a novel idea:\nestimate and minimize the dependency via a non-parametric statistical penalty. Given the AEVB\n\n4\n\n\findependence constraints onZ0\u222aS0 learns the parameters \u03b8, \u03c6 from maximizing the ELBO from\n\n\u2212 \u03bbdHSIC(\u02c6q\u03c6(zi1, .., zip)pdata(sj1, .., sjq)).\n\nAEVB penalized by\n\n(9)\nA few comments are in order regarding this penalty. First, the dHSIC is positive and therefore\nour objective function is still a lower bound on the log-likelihood. The bound will be looser but\nthe resulting parameters will yield a more suitable representation. This trade-off is adjustable via\nthe parameter \u03bb. Second, the dHSIC can be estimated with the same samples used for stochastic\nvariational inference (i.e., sampling from the variational distribution) and for minibatch sampling (i.e.,\nsubsampling the dataset). Third, the HSIC penalty is based only on the variational parameters\u2014not\nthe parameters of the generative model.\n\n4 Case study: Learning interpretable representations\n\non real datasets. However, this penalization has been shown to yield poor reconstruction perfor-\nmance [25]. The \u03b2-TCVAE [6] penalized an approximation of the total correlation (TC), de\ufb01ned as\n\nvariations in the data. Learning independent representations is then a key step towards learn-\ning disentangled representations [6, 5, 23, 24]. The \u03b2-VAE [5] proposes further penalizing the\n\nSuppose we want to summarize the data x with two independent components u and v, as shown in\nFigure 1a. The task is especially important for data exploration since independent representations are\noften more easily interpreted.\n\nA related problem is \ufb01nding latent factors(z1, ..., zd) that correspond to real and interpretable\nDKL(q\u03c6(z x) p(z)) term. It attains signi\ufb01cant improvement over state-of-the art methods\nDKL(\u02c6q\u03c6(z)\u220fk \u02c6q\u03c6(zk)) [26], which is a measure of multivariate mutual independence. However,\nHowever, the bias from the HSIC [17] is of orderO(1~n); it is negligible whenever the batch-size is\nconsider a linear Gaussian system, for which exact posterior inference is tractable. Let(n, m, d)\u2208 N3\nand \u03bb\u2208 R+. Let(A, B)\u2208 Rd\u00d7n\u00d7 Rd\u00d7m be random matrices with iid normal entries. Let \u03a3\u2208 Rd\u00d7d\n\nlarge enough. HSIC therefore appears to be a more suitable method to enforce independence in the\nlatent space.\nTo assess the performance of these various approaches to \ufb01nding independent representations, we\n\nthis quantity does not have a closed-form solution [3] and the \u03b2-TCVAE uses a biased estimator of\nthe TC\u2014a lower bound from Jensen inequality. That bias will be zero only if evaluated on the whole\ndataset, which is not possible since the estimator has quadratic complexity in the number of samples.\n\nbe a random matrix following a Wishart distribution. Consider the following generative model:\n\nv\u223c Normal(0, In)\nu\u223c Normal(0, Im)\n\nx u, v\u223c Normal(Av+ Bu, \u03bbId+ \u03a3).\n\n(10)\n\nThe exact posterior p(u, v x) is tractable via block matrix inversion, as is the marginal p(x), as\nshown in Appendix C. We apply HCV with Z={u, v}, X={x}, S=\u089d,Z0={u, v}, andS0=\u089d.\nThis is equivalent to adding to the ELBO the penalty\u2212\u03bbHSIC(Epdata(x)q\u03c6(u, v x)). Appendix\nPearson correlation\u2211(i,j) \u03c1(\u02c6q\u03c6(ui), \u02c6q\u03c6(vj)) and HSIC.\nin the aggregated posterior \u02c6q\u03c6(u, v) than in the exact posterior \u02c6p(u, v) (vertical bar) for the two\n\nD describes the stochastic training procedure. We report the trade-off between correlation of the\nrepresentation and the ELBO for various penalty weights \u03bb for each algorithm: \u03b2-VAE [5], \u03b2-\nTCVAE [6], an unconstrained VAE, and HCV. As correlation measures, we consider the summed\n\nResults are reported in Figure 2. The VAE baseline (like all the other methods) has an ELBO value\nworse than the marginal log-likelihood (horizontal bar) since the real posterior is not likely to be in\nthe function class given by naive mean \ufb01eld AEVB. Also, this baseline has a greater dependence\n\nmeasures of correlation. Second, while correcting the variational posterior, we want the best trade-off\nbetween model \ufb01t and independence. HCV attains the highest ELBO values despite having the lowest\ncorrelation.\n\n5 Case study: Learning invariant representations\n\nWe now consider the particular problem of learning representations for the data that is invariant to a\ngiven nuisance variable. As a particular instance of the graphical model in Figure 1b, we embed an\n\n5\n\n\fFigure 2: Results for the linear Gaussian system. All results are for a test set. Each dot is averaged\nacross \ufb01ve random seeds. Larger dots indicate greater regularization. The purple line is the log-\nlikelihood under the true posterior. The cyan line is the correlation under the true posterior.\n\nimage x into a latent vector z1 whose distribution is independent of the observed lighting condition s\nwhile being predictive of the person identity y (Figure 3). The generative model is de\ufb01ned in Figure 3c\n\nand the variational distribution decomposes as q\u03c6(z1, z2 x, s, y)= q\u03c6(z1 x, s)q\u03c6(z2 z1, y), as\n\nin [7].\n\ny\n\ns\n\nz2\n\nz1\n\nx\n\n(a) s: angle between the camera\n\nand the light source\n\n(b) One image x for a given\n\nlighting condition s and person y\n\n(c) Complete graphical model\n\nFigure 3: Framework for learning invariant representations in the Extended Yale B Face dataset.\n\nThis problem has been studied in [7] for binary or categorical s. For their experiment with a\n\nHSIC penalty. (We present a proof of this fact in Appendix D.)\nProposition 2. Let the nuisance factor s be a discrete random variable and let l (the kernel\n\ncontinuous covariate s, they discretize s and use the MMD to match the distributions \u02c6q\u03c6(z1 s= 0)\nand \u02c6q\u03c6(z1 s= j) for all j. Perhaps surprisingly, their penalty turns out to be a special case of our\nfor K) be a Kronecker delta function \u03b4 \u2236 (s, s\u2032) \u0015 1s=s\u2032. Then, the V-statistic correspond-\ning to HSIC(\u02c6q\u03c6(z1), pdata) is a weighted sum of the V-statistics of the MMD between the pairs\n\u02c6q\u03c6(z s= i), \u02c6q\u03c6(z s= j). The weights are functions of the empirical probabilities for s.\nAEVB, Z={z1, z2}, X={x, y}, S={s},Z0={z1} andS0={s}.\n\nWorking with the HSIC rather than an MMD penalty lets us avoid discretizing s. We take into account\nthe whole angular range and not simply the direction of the light. We apply HCV with mean-\ufb01eld\n\nDataset The extended Yale B dataset [27] contains cropped faces [28] of 38 people under 50\nlighting conditions. These conditions are unit vectors in R3 encoding the direction of the light source\nand can be summarized into \ufb01ve discrete groups (upper right, upper left, lower right, lower left and\nfront). Following [7], we use one image from each group per person (total 190 images) and use\nthe remaining images for testing. The task is to learn a representation of the faces that is good at\nidentifying people but has low correlation with the lighting conditions.\n\n6\n\n0.050.100.15Pearson Correlation220210200190180ELBO-VAEHCVTC-VAEVAE4.44.2log-HSIC220210200190180ELBO-VAEHCVTC-VAEVAE\fthe classi\ufb01cation accuracy for the lighting group condition (\ufb01ve-way classi\ufb01cation) based on a logistic\n\nExperiment We repeat the experiments from the paper introducing the variational fair auto-encoder\n(VFAE) [7], this time comparing the VAE [1] with no covariate s, the VFAE [7] with observed lighting\ndirection groups (\ufb01ve groups), and the HCV with the lighting direction vector (a three-dimensional\nvector). As a supplemental baseline, we also report results for the unconstrained VAEs. As in [7], we\n\nreport 1) the accuracy for classifying the person based on the variational distribution q\u03c6(y z1, s); 2)\nregression and a random forest classi\ufb01er on a sample from the variational posterior q\u03c6(z1 z2, y, s)\nand a random forest regressor, trained on a sample from the variational posterior q\u03c6(z1 z2, y, s).\nre\ufb01ned lightning direction) always improves the quality of the classi\ufb01er q\u03c6(y z1, s). This can be\n\nError is expressed in degrees. \u03bb is optimized via grid search as in [7].\nWe report our results in Table 1. As expected, adding information (either the lightning group or the\n\nfor each datapoint; and 3) the average error for predicting the lighting direction with linear regression\n\nseen by comparing the scores between the vanilla VAE and the unconstrained algorithms. However,\nby using side information s, the unconstrained models yield a representation less suitable because it\nis more correlated with the nuisance variables. There is therefore a trade-off between correlation to\nthe nuisance and performance. Our proposed method (HCV) shows greater invariance to lighting\ndirection while accurately predicting people\u2019s identities.\n\nPerson identity\n\n(Accuracy)\n\n0.72\n0.74\n0.69\n0.75\n0.75\n\nVAE\n\nVFAE\u2217\nHCV\u2217\n\nVFAE\n\nHCV\n\nLighting group\n\n(Average classi\ufb01cation error)\nLogistic\nRandom Forest\nRegression\n\nClassi\ufb01er\n\n0.26\n0.23\n0.51\n0.25\n0.52\n\n0.11\n0.01\n0.42\n0.10\n0.29\n\nLighting direction\n\n(Average error in degree)\nLinear\n\nRandom Forest\n\nRegressor\n\nRegression\n\n14.07\n13.96\n23.59\n12.25\n36.15\n\n9.40\n8.63\n19.89\n2.59\n28.04\n\nTable 1: Results on the Extended Yale B dataset. Preprocessing differences likely explain the slight\n\ndeviation in scores from [7]. Stars (\u2217) the unconstrained version of the algorithm was used.\n\n6 Case study: Learning denoised representations\n\nThis section presents a case study of denoising datasets in the setting of an important open sci-\nenti\ufb01c problem. The task of denoising consists of representing experimental observations x and\nnuisance observations s with two independent signals: biological signal z and technical noise u.\nThe dif\ufb01culty is that x contains both biological signal and noise and is therefore strongly correlated\nwith s (Figure 1c). In particular, we focus on single-cell RNA sequencing (scRNA-seq) data which\nrenders a gene-expression snapshot of an heterogeneous sample of cells. Such data can reveal a cell\u2019s\ntype [29, 30], if we can cope with a high level of technical noise [31].\n\nThe output of an scRNA-seq experiment is a list of transcripts(lm)m\u2208M. Each transcript lm is\n\nan mRNA molecule enriched with a cell-speci\ufb01c barcode and a unique molecule identi\ufb01er, as in\n[32]. Cell-speci\ufb01c barcodes enable the biologist to work at single-cell resolution. Unique molecule\nidenti\ufb01ers (UMIs) are meant to remove some signi\ufb01cant part of the technical bias (e.g., ampli\ufb01cation\nbias) and make it possible to obtain an accurate probabilistic model for these datasets [33]. Transcripts\nare then aligned to a reference genome with tools such as CellRanger [34].\n\nThe data from the experiment has two parts. First, there is a gene expression matrix(Xng)(n,g)\u2208N\u00d7G,\nwhereN designates the set of cells detected in the experiment andG is the set of genes the transcripts\ngene has been expressed in a particular cell. Second, we have quality control metrics(si)i\u2208S\n\nhave been aligned with. A particular entry of this matrix indicates the number of times a particular\n\n(described in Appendix E) which assess the level of errors and corrections in the alignment process.\nThese metrics cannot be described with a generative model as easily as gene expression data but they\nnonetheless impact a signi\ufb01cant number of tasks in the research area [35]. Another signi\ufb01cant portion\nof these metrics focus on the sampling effects (i.e., the discrepancy in the total number of transcripts\n\n7\n\n\fcaptured in each cell) which can be taken into account in a principled way in a graphical model as\nin [33].\nWe visualize these datasets x and s with tSNE [36] in Figure 4. Note that x is correlated with s,\nespecially within each cell type. A common application for scRNA-seq is discovering cell types,\nwhich can be be done without correcting for the alignment errors [37]. A second important application\nis identifying genes that are more expressed in one cell type than in another\u2014this hypothesis testing\nproblem is called differential expression [38, 39]. Not modeling s can induce a dependence on x\nwhich hampers hypothesis testing [35].\nMost research efforts in scRNA-seq methodology research focus on using generalized linear models\nand two-way ANOVA [40, 35] to regress out the effects of quality control metrics. However, this\nparadigm is incompatible with hypothesis testing. A generative approach, however, would allow\nmarginalizing out the effect of these metrics, which is more aligned with Bayesian principles. Our\nmain contribution is to incorporate these alignment errors into our graphical model to provide a better\n\nBayesian testing procedure. We apply HCV with Z={z, u}, X={x, s},Z0={z, u}. By integrating\nout u while sampling from the variational posterior,\u222b q\u03c6(x z, u)dp(u), we \ufb01nd a Bayes factor\n\nthat is not subject to noise. (See Appendix F for a complete presentation of the hypothesis testing\nframework and the graphical model under consideration).\n\n(a) Embedding of x: gene\n\nexpression data. Each point is a\n\ncell. Colors are cell-types.\n\n(b) Embedding of s: alignment\nerrors. Each point is a cell. Color\n\nis s1.\n\n(c) Embedding of x: gene\n\nexpression data. Each point is a\ncell. Color is the same quality\n\ncontrol metric s1.\n\nFigure 4: Raw data from the PBMC dataset. s1 is the proportion of transcripts which con\ufb01dently\nmapped to a gene for each cell.\n\nDataset We considered scRNA-seq data from peripheral blood mononuclear cells (PBMCs) from a\nhealthy donor [34]. Our dataset includes 12,039 cells and 3,346 genes, \ufb01ve quality control metrics\nfrom CellRanger and cell-type annotations extracted with Seurat [41]. We preprocessed the data as in\n[33, 35]. Our ground truth for the hypothesis testing, from microarray studies, is a set of genes that\nare differentially expressed between human B cells and dendritic cells (n=10 in each group [42]).\n\nExperiment We compare scVI [33], a state-of-the-art model, with no observed nuisance variables\n(8 latent dimensions for z), and our proposed model with observed quality control metrics. We use\n\ufb01ve latent dimensions for z and three for u. The penalty \u03bb is selected through grid search. For each\nalgorithm, we report 1) the coef\ufb01cient of determination of a linear regression and random forest\nregressor for the quality metrics predictions based on the latent space, 2) the irreproducible discovery\nrate (IDR) [43] model between the Bayes factor of the model and the p-values from the micro-array.\nThe mixture weights, reported in [33], are similar between the original scVI and our modi\ufb01cation (and\ntherefore higher than other mainstream differential expression procedures) and saturate the number\n\nof signi\ufb01cant genes in this experiment (\u223c23%). We also report the correlation of the reproducible\n\nmixture as a second-order quality metric for our gene rankings.\nWe report our results in Table 2. First, the proposed method ef\ufb01ciently removes much correlation with\nthe nuisance variables s in the latent space z. Second, the proposed method yields a better ranking\nof the genes when performing Bayesian hypothesis testing. This is shown by a substantially higher\ncorrelation coef\ufb01cient for the IDR, which indicates the obtained ranking better conforms with the\nmicro-array results. Our denoised latent space is therefore extracting information from the data that\nis less subject to alignment errors and more biologically interpretable.\n\n8\n\n\fIrreproducible Discovery Rate\nReproducible\ncorrelation\n\nMixture\nweight\n\n0.213\u00b1 0.001\nHCV 0.217\u00b1 0.003\n\nscVI\n\n0.26\u00b1 0.07\n0.43\u00b1 0.02\n\nQuality control metrics\n\n(coef\ufb01cient of determination)\nRandom Forest\n\nLinear\n\nRegression\n\nRegression\n\n0.195\n0.176\n\n0.129\n0.123\n\nTable 2: Results on the PBMCs dataset. IDR results are averaged over twenty initializations.\n\n7 Discussion\n\nWe have presented a \ufb02exible framework for correcting independence properties of aggregated vari-\national posteriors learned via naive mean \ufb01eld AEVB. The correction is performed by penalizing\nthe ELBO with the HSIC\u2014a kernel-based measure of dependency\u2014between samples from the\nvariational posterior.\nWe illustrated how variational posterior misspeci\ufb01cation in AEVB could unwillingly promote depen-\ndence in the aggregated posterior. Future work should look at other variational approximations and\nquantify this dependence.\nPenalizing the HSIC as we do for each mini-batch implies that no information is learned about\n\ndistribution \u02c6q(Z) or\u220fi \u02c6q(zi) during training. On one hand, this is positive since we do no have\n\nto estimate more parameters, especially if the joint estimation would imply a minimax problem as\nin [23, 13]. One the other hand, that could be harmful if the HSIC could not be estimated with only a\nmini-batch. Our experiments show this does not happen in a reasonable set of con\ufb01gurations.\nTrading a minimax problem for an estimation problem does not come for free. First, there are some\ncomputational considerations. The HSIC is computed in quadratic time but linear time estimators\nof dependence [44] or random features approximations [45] should be used for non-standard batch\nsizes. For example, to train on the entire extended Yale B dataset, VAE takes two minutes, VFAE\ntakes ten minutes2, and HCV takes three minutes. Second, the problem of choosing the best kernel is\nknown to be dif\ufb01cult [46]. In the experiments, we rely on standard and ef\ufb01cient choices: a Gaussian\nkernel with median heuristic for the bandwidth. The bandwidth can be chosen analytically in the case\nof a Gaussian latent variable and done of\ufb02ine in case of an observed nuisance variable. Third, the\ngeneral formulation of HCV with the dHSIC penalization, as in Equation 9, should be nuanced since\nthe V-statistic relies on a U-statistic of order 2d. Standard non-asymptotic bounds as in [4] would\n\nd~n) and therefore not scale well for a large number of variables.\n\nexhibit a concentration rate ofO(\u0001\n\nWe also applied our HCV framework to scRNA-seq data to remove technical noise. The same\ngraphical model can be readily applied to several other problems in the \ufb01eld. For example, we\nmay wish to remove cell cycles [47] that are biologically variable but typically independent of what\nbiologists want to observe. We hope our approach will empower biological analysis with scalable\nand \ufb02exible tools for data interpretation.\n\nAcknowledgments\n\nNY and RL were supported by grant U19 AI090023 from NIH-NIAID.\n\nReferences\n\n[1] Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In International Conference on\n\nLearning Representations, 2014.\n\n[2] Ruslan Salakhutdinov and Hugo Larochelle. Ef\ufb01cient learning of deep boltzmann machines. In Proceedings\nof the Thirteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pages 693\u2013700, 2010.\n[3] Jean-Louis Durrieu, Jean-Philippe Thiran, and Finnian Kelly. Lower and Upper bounds for approximation\nof the Kullback-Leibler divergence between Gaussian mixture models. In IEEE International Conference\non Acoustics, Speech and Signal Processing, pages 4833\u20134836, 2012.\n\n2VFAE is slower because of the discrete operation it has to perform to form the samples for estimating the\n\nMMD.\n\n9\n\n\f[4] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch\u00f6lkopf. Measuring statistical dependence\n\nwith Hilbert-Schmidt norms. In Algorithmic Learning Theory, pages 63\u201377, 2005.\n\n[5] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir\nMohamed, and Alexander Lerchner. \u03b2-VAE: Learning basic visual concepts with a constrained variational\nframework. In International Conference on Learning Representations, 2017.\n\n[6] Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement\nin variational autoencoders. In International Conference on Learning Representations: Workshop Track,\n2018.\n\n[7] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The Variational Fair\n\nAutoencoder. In International Conference on Learning Representations, 2016.\n\n[8] Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis R. Bach. Supervised\n\ndictionary learning. In Advances in Neural Information Processing Systems, pages 1033\u20131040, 2009.\n\n[9] Andrew Gelman and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models.\n\nCambridge University Press, 2007.\n\n[10] Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing\ngraphical models with neural networks for structured representations and fast inference. In Advances in\nNeural Information Processing Systems, pages 2946\u20132954, 2016.\n\n[11] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved\nvariational inference with inverse autoregressive \ufb02ow. In Advances in Neural Information Processing\nSystems, pages 4743\u20134751, 2016.\n\n[12] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\nlearning with deep generative models. In Advances in Neural Information Processing Systems, pages\n3581\u20133589, 2014.\n\n[13] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial\n\nAutoencoders. In International Conference on Learning Representations: Workshop Track, 2016.\n\n[14] Matthew D Hoffman and Matthew J Johnson. ELBO surgery: yet another way to carve up the variational\n\nevidence lower bound. In Advances in Approximate Bayesian Inference, NIPS Workshop, 2016.\n\n[15] Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Sch\u00f6lkopf. Kernel measures of conditional\n\ndependence. In Advances in Neural Information Processing Systems, pages 489\u2013496, 2008.\n\n[16] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch\u00f6lkopf, and Alexander Smola. A\n\nKernel Two-Sample Test. Journal of Machine Learning Research, 13:723\u2013773, 2012.\n\n[17] Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Sch\u00f6lkopf, and Alexander J. Smola.\nA kernel statistical test of independence. In Advanced in Neural Information Processing Systems, pages\n585\u2013592, 2008.\n\n[18] P\ufb01ster Niklas, B\u00fchlmann Peter, Sch\u00f6lkopf Bernhard, and Peters Jonas. Kernel-based tests for joint\nindependence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):5\u201331,\n2017.\n\n[19] Zolt\u00e1n Szab\u00f3 and Bharath K. Sriperumbudur. Characteristic and universal tensor product kernels. Journal\n\nof Machine Learning Research, 18(233):1\u201329, 2018.\n\n[20] J\u00fcrgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation,\n\n4(6):863\u2013879, 1992.\n\n[21] Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In Interna-\n\ntional Conference on Learning Representations, 2016.\n\n[22] Jianbo Chen, Le Song, Martin J Wainwright, and Michael I Jordan. Learning to Explain: An Information-\nTheoretic Perspective on Model Interpretation. In Proceedings of the 35th International Conference on\nMachine Learning, volume 80, pages 882\u2013891, 2018.\n\n[23] Hyunjik Kim and Andriy Mnih. Disentangling by Factorising. In Learning Disentangled Representations:\n\nNIPS Workshop, 2017.\n\n[24] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN:\nIn\n\nInterpretable Representation Learning by Information Maximizing Generative Adversarial Nets.\nAdvances in Neural Information Processing Systems, pages 2172\u20132180, 2016.\n\n[25] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and\nAlexander Lerchner. Understanding disentangling in \u03b2-VAE. In Learning Disentangled Representations,\nNIPS Workshop, 2017.\n\n[26] Satosi Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of Research\n\nand Development, 4(1):66\u201382, 1960.\n\n10\n\n\f[27] David J. Kriegman Athinodoros S. Georghiades, Peter N. Belhumeur. From few to many: Illumination\ncone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis\nand Machine Intelligence, 23(6):643\u2013660, 2001.\n\n[28] David J Kriegman Kuang-Chih Lee, Jeffrey Ho. Acquiring linear subspaces for face recognition under\nvariable lighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5):684\u2013698, 2005.\n[29] Allon Wagner, Aviv Regev, and Nir Yosef. Revealing the vectors of cellular identity with single-cell\n\ngenomics. Nature Biotechnology, 34(11):1145\u20131160, 2016.\n\n[30] Amos Tanay and Aviv Regev. Scaling single-cell genomics from phenomenology to mechanism. Nature,\n\n541:331\u2013338, 2017.\n\n[31] Dominic Grun, Lennart Kester, and Alexander van Oudenaarden. Validation of noise models for single-cell\n\ntranscriptomics. Nature Methods, 11(6):637\u2013640, 2014.\n\n[32] Allon M Klein, Linas Mazutis, Ilke Akartuna, Naren Tallapragada, Adrian Veres, Victor Li, Leonid\nPeshkin, David A Weitz, and Marc W Kirschner. Droplet barcoding for single-cell transcriptomics applied\nto embryonic stem cells. Cell, 161(5):1187\u20131201, 2015.\n\n[33] Romain Lopez, Jeffrey Regier, Michael B. Cole, Michael I. Jordan, and Nir Yosef. Bayesian Inference for\n\na Generative Model of Transcriptome Pro\ufb01les from Single-cell RNA Sequencing. bioRxiv, 2018.\n\n[34] Grace X.Y. Zheng, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson,\nSolongo B Ziraldo, Tobias D Wheeler, Geoff P. McDermott, Junjie Zhu, Mark T Gregory, Joe Shuga, Luz\nMontesclaros, Jason G Underwood, Donald A Masquelier, Stefanie Y Nishimura, Michael Schnall-Levin,\nPaul W Wyatt, Christopher M. Hindson, Rajiv Bharadwaj, Alexander Wong, Kevin D Ness, Lan W Beppu,\nH Joachim Deeg, Christopher McFarland, Keith R Loeb, William J Valente, Nolan G Ericson, Emily A\nStevens, Jerald P Radich, Tarjei S Mikkelsen, Benjamin J Hindson, and Jason H Bielas. Massively parallel\ndigital transcriptional pro\ufb01ling of single cells. Nature Communications, 8, 2017.\n\n[35] Michael B Cole, Davide Risso, Allon Wagner, David DeTomaso, John Ngai, Elizabeth Purdom, Sandrine\nDudoit, and Nir Yosef. Performance Assessment and Selection of Normalization Procedures for Single-Cell\nRNA-Seq. bioRxiv, 2017.\n\n[36] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning\n\nResearch, 9:2579\u20132605, 2008.\n\n[37] Bo Wang, Junjie Zhu, Emma Pierson, Daniele Ramazzotti, and Sera\ufb01m Batzoglou. Visualization and\nanalysis of single-cell RNA-seq data by kernel-based similarity learning. Nature Methods, 14(4):414\u2013416,\n2017.\n\n[38] Michael I. Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion\n\nfor rna-seq data with deseq2. Genome Biology, 15(12):550, 2014.\n\n[39] Greg Finak, Andrew McDavid, Masanao Yajima, Jingyuan Deng, Vivian Gersuk, Alex K Shalek, Chloe K\nSlichter, Hannah W Miller, M Juliana McElrath, Martin Prlic, et al. Mast: a \ufb02exible statistical framework\nfor assessing transcriptional changes and characterizing heterogeneity in single-cell rna sequencing data.\nGenome Biology, 16(1):278, 2015.\n\n[40] Davide Risso, Fanny Perraudeau, Svetlana Gribkova, Sandrine Dudoit, and Jean-Philippe Vert. A general\nand \ufb02exible method for signal extraction from single-cell rna-seq data. Nature Communications, 9(1):284,\n2018.\n\n[41] Evan Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay\nTirosh, Allison Bialas, Nolan Kamitaki, Emily Martersteck, John Trombetta, David Weitz, Joshua Sanes,\nAlex Shalek, Aviv Regev, and Steven McCarroll. Highly parallel genome-wide expression pro\ufb01ling of\nindividual cells using nanoliter droplets. Cell, 161(5):1202\u20131214, 2017.\n\n[42] Helder I Nakaya, Jens Wrammert, Eva K Lee, Luigi Racioppi, Stephanie Marie-Kunze, W Nicholas\nHaining, Anthony R Means, Sudhir P Kasturi, Nooruddin Khan, Gui Mei Li, Megan McCausland, Vibhu\nKanchan, Kenneth E Kokko, Shuzhao Li, Rivka Elbein, Aneesh K Mehta, Alan Aderem, Kanta Subbarao,\nRa\ufb01 Ahmed, and Bali Pulendran. Systems biology of vaccination for seasonal in\ufb02uenza in humans. Nature\nImmunology, 12(8):786\u2013795, 2011.\n\n[43] Qunhua Li, James B Brown, Haiyan Huang, and Peter J Bickel. Measuring reproducibility of high-\n\nthroughput experiments. Annals of Applied Statistics, 5(3):1752\u20131779, 2011.\n\n[44] Wittawat Jitkrittum, Zolt\u00e1n Szab\u00f3, and Arthur Gretton. An adaptive test of independence with analytic\nkernel embeddings. In Proceedings of the 34th International Conference on Machine Learning, pages\n1742\u20131751, 2017.\n\n[45] Adri\u00e1n P\u00e9rez-Suay and Gustau Camps-Valls. Sensitivity maps of the Hilbert\u2013Schmidt independence\n\ncriterion. Applied Soft Computing Journal, 2018.\n\n[46] Seth Flaxman, Dino Sejdinovic, John P Cunningham, and Sarah Filippi. Bayesian learning of kernel\n\nembeddings. In Proceedings of the 32nd Conference on Uncertainty in Arti\ufb01cial Intelligence, 2016.\n\n11\n\n\f[47] Florian Buettner, Kedar N Natarajan, F Paolo Casale, Valentina Proserpio, Antonio Scialdone, Fabian J\nTheis, Sarah A Teichmann, John C Marioni, and Oliver Stegle. Computational analysis of cell-to-\ncell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nature\nBiotechnology, 33(2):155\u2013160, 2015.\n\n12\n\n\f", "award": [], "sourceid": 3018, "authors": [{"given_name": "Romain", "family_name": "Lopez", "institution": "UC Berkeley"}, {"given_name": "Jeffrey", "family_name": "Regier", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}, {"given_name": "Nir", "family_name": "Yosef", "institution": "UC Berkeley"}]}