{"title": "Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data", "book": "Advances in Neural Information Processing Systems", "page_first": 9115, "page_last": 9124, "abstract": "Precision medicine aims for personalized prognosis and therapeutics by utilizing recent genome-scale high-throughput profiling techniques, including next-generation sequencing (NGS). However, translating NGS data faces several challenges. First, NGS count data are often overdispersed, requiring appropriate modeling. Second, compared to the number of involved molecules and system complexity, the number of available samples for studying complex disease, such as cancer, is often limited, especially considering disease heterogeneity. The key question is whether we may integrate available data from all different sources or domains to achieve reproducible disease prognosis based on NGS count data. In this paper, we develop a Bayesian Multi-Domain Learning (BMDL) model that derives domain-dependent latent representations of overdispersed count data based on hierarchical negative binomial factorization for accurate cancer subtyping even if the number of samples for a specific cancer type is small. Experimental results from both our simulated and NGS datasets from The Cancer Genome Atlas (TCGA) demonstrate the promising potential of BMDL for effective multi-domain learning without ``negative transfer'' effects often seen in existing multi-task learning and transfer learning methods.", "full_text": "Bayesian multi-domain learning for cancer subtype\ndiscovery from next-generation sequencing count data\n\nEhsan Hajiramezanali\nTexas A&M University\n\nehsanr@tamu.edu\n\nSiamak Zamani Dadaneh\n\nTexas A&M University\n\nsiamak@tamu.edu\n\nMingyuan Zhou\n\nUniversity of Texas at Austin\n\nMingyuan.Zhou@mccombs.utexas.edu\n\nAlireza Karbalayghareh\nTexas A&M University\nalireza.kg@tamu.edu\n\nXiaoning Qian\n\nTexas A&M University\nxqian@ece.tamu.edu\n\nAbstract\n\nPrecision medicine aims for personalized prognosis and therapeutics by utilizing re-\ncent genome-scale high-throughput pro\ufb01ling techniques, including next-generation\nsequencing (NGS). However, translating NGS data faces several challenges. First,\nNGS count data are often overdispersed, requiring appropriate modeling. Second,\ncompared to the number of involved molecules and system complexity, the number\nof available samples for studying complex disease, such as cancer, is often lim-\nited, especially considering disease heterogeneity. The key question is whether\nwe may integrate available data from all different sources or domains to achieve\nreproducible disease prognosis based on NGS count data. In this paper, we develop\na Bayesian Multi-Domain Learning (BMDL) model that derives domain-dependent\nlatent representations of overdispersed count data based on hierarchical negative\nbinomial factorization for accurate cancer subtyping even if the number of samples\nfor a speci\ufb01c cancer type is small. Experimental results from both our simulated and\nNGS datasets from The Cancer Genome Atlas (TCGA) demonstrate the promising\npotential of BMDL for effective multi-domain learning without \u201cnegative transfer\u201d\neffects often seen in existing multi-task learning and transfer learning methods.\n\n1\n\nIntroduction\n\nIn this paper, we study Bayesian Multi-Domain Learning (BMDL) for analyzing count data from\nnext-generation sequencing (NGS) experiments, with the goal of enhancing cancer subtyping in the\ntarget domain with a limited number of NGS samples by leveraging surrogate data from other domains,\nfor example relevant data from other well-studied cancer types. Due to both biological and technical\nlimitations, it is often dif\ufb01cult and costly, if not prohibitive, to collect enough samples when studying\ncomplex diseases, especially considering the complexity of disease processes. When studying one\ncancer type, there are typically at most hundreds of samples available with tens of thousands of\ngenes/molecules involved, including in the case of the arguably largest cancer consortium, The Cancer\nGenome Atlas (TCGA) [The Cancer Genome Atlas Research Network et al., 2008]. Considering the\nheterogeneity in cancer and the potential cost of clinical studies and pro\ufb01ling, we usually have only\nless than one hundred samples, which often does not lead to generalizable results. Our goal here is to\ndevelop effective ways to derive predictive feature representations using available NGS data from\ndifferent sources to help accurate and reproducible cancer subtyping.\nThe assumption of having only one domain is restrictive in many practical scenarios due to the\nnonstationarity of the underlying system and data heterogeneity. Multi-task learning (MTL), transfer\nlearning (TL), and domain adaptation (DA) techniques have recently been utilized to leverage the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\frelevant data and knowledge of different domains to improve the predictive power in all domains or\none target domain [Pan and Yang, 2010, Patel et al., 2015]. In MTL, there are D different labeled\ndomains where data are related and the goal is to improve the predictive power of all domains\naltogether. In TL, there are D \u2212 1 source domains and one target domain such that we have plenty\nof labeled data in the source domains and a few labeled data in the target domain, and the goal is to\ntake advantage of source data, for example by domain adaptation, to improve the predictive power in\nthe target domain. Although many TL and MTL methods have been proposed, \u201cnegative transfer\u201d\nmay happen with degraded performance when the domains are not related but the methods force to\n\u201ctransfer\u201d the data and model knowledge. There still lacks a rigorous theoretical understanding when\ndata from different domains can help each other due to the discriminative nature of these methods.\nIn this paper, instead of following most of TL/MTL methods relying on discriminative models\np(y|\u03b8, (cid:126)n) given high-dimensional count data (cid:126)n, we propose a generative framework to learn more\n\ufb02exible latent representations of (cid:126)n from different domains. We \ufb01rst construct a Bayesian hierarchical\nmodel p((cid:126)n), which is essentially a factorization model for counts (cid:126)n, to derive domain-dependent latent\nrepresentations allowing both domain-speci\ufb01c and globally shared latent factors. Then the learned\nlow-dimensional representations can be used together with any supervised or unsupervised predictive\nmodels for cancer subtyping. Due to its unsupervised nature when deriving latent representations, we\nterm our model as Bayesian Multi-Domain Learning (BMDL). This is desirable in cancer subtyping\nsince we may not always have labeled data and thus the model \ufb02exibility of BMDL enables effective\ntransfer learning across different domains, with or without labeled data.\nBy allowing the assignment of the inferred latent factors to each domain independently based on\nthe amount of contribution of each latent factor to that domain, BMDL can automatically learn the\nsample relevance across domains based on the number of shared latent factors in a data-driven manner.\nOn the other hand, the domain-speci\ufb01c latent factors help keep important information in each domain\nwithout severe information loss in the derived domain-dependent latent representations of the original\ncount data. Therefore, BMDL automatically avoids \u201cnegative transfer\" with which many TL/MTL\nmethods are dealing. At the same time, the number of shared latent factors can serve as one possible\nmeasure of domain relevance that may lead to more rigorous theoretical study of TL/MTL methods.\nSpeci\ufb01cally, for BMDL, we propose a novel multi-domain negative binomial (NB) factorization\nmodel for over-dispersed NGS count data. Similar as Dadaneh et al. [2018] and Hajiramezanali\net al. [2018], we employ NB distributions for count data to obviate the need for multiple ad-hoc pre-\nprocessing steps as required in most of gene expression analyses. More precisely, BMDL identi\ufb01es\ndomain-speci\ufb01c and globally shared latent factors in different sequencing experiments as domains,\ncorresponding to gene modules signi\ufb01cant for subtyping different cancer types for example, and then\nuse them to improve subtyping performance in a target domain with a very small number of samples.\nWe introduce latent binary \u201cselector\u201d variables which help assign the factors to different domains.\nInspired by Indian Buffet Process (IBP) [Ghahramani and Grif\ufb01ths, 2006], we impose beta-Bernoulli\npriors over them, leading to sparse domain-dependent latent factor representations. By exploiting\na novel data augmentation technique for the NB distribution [Zhou and Carin, 2015], an ef\ufb01cient\nGibbs sampling algorithm with closed-form updates is derived for BMDL. Our experiments on both\nsynthetic and real-world RNA-seq datasets verify the bene\ufb01ts of our model in improving predictive\npower in domains with small training sets by borrowing information from domains with rich training\ndata. In particular, we demonstrate a substantial increase in cancer subtyping accuracy by leveraging\nrelated RNA-seq datasets, and also show that in scenarios with unrelated datasets, our method does\nnot create adverse effects.\n\n2 Related work\n\nTL/MTL methods typically assume some notions of relevance across domains of the corresponding\ntasks: All tasks under study either possess a cluster structure [Xue et al., 2007, Jacob et al., 2009,\nKang et al., 2011], share feature representations in common low-dimensional subspaces [Argyriou\net al., 2007, Rai and Daume III, 2010], or have parameters drawn from shared prior distributions\n[Chelba and Acero, 2006]. Most of these methods force the corresponding assumptions for MTL to\nlink the data across domains. However, when tasks are not related to the corresponding data from\ndifferent underlying distributions, forcing MTL may lead to degraded performance. To solve this\nproblem, Passos et al. [2012] have proposed a Bayesian nonparametric MTL model by representing\nthe task parameters as a mixture of latent factors. However, this model requires the number of both\n\n2\n\n\flatent factors and mixtures to be less than the number of domains. This may lead to information loss\nand the model only has shown advantage when the number of domains is high. But in real-world\napplications, when analyzing cancer data for example, we may only have a small number of domains.\nKumar and Daume III [2012] have assumed the task parameters within a group of related tasks lie in\na low-dimensional subspace and allowed the tasks in different groups to overlap with each other in\none or more bases. But this model requires a large number of training samples across domains.\nThe hierarchical Dirichlet process (HDP) [Teh et al., 2005] has been proposed to borrow statistical\nstrengths across multiple groups by sharing mixture components. Although HDP is aimed for a\ngeneral family of distributions, to make it more suitable for modeling count data, special efforts\npertaining to the application need to be carried out. To directly model the counts assigned to mixture\ncomponents as NB random variables, Zhou and Carin [2015] have performed a joint count and\nmixture modeling via the NB process. Under the NB process and integrated to HDP [Teh et al.,\n2005], NB-HDP employed a Dirichlet process (DP) to model the rate measure of a Poisson process.\nHowever, NB-HDP is constructed by \ufb01xing the probability parameter of NB distribution. While\n\ufb01xing the probability parameter of NB is a natural choice in mixture modeling, where it appears\nirrelevant after normalization, it would make a restrictive assumption that each count vector has the\nsame variance-to-mean ratio. This is not proper for NGS count modeling in this paper. Closely related\nto the multinomial mixed-membership models, Zhou [2018] have proposed the hierarchical gamma-\nnegative binomial process (hGNBP) to support countably in\ufb01nite factors for negative binomial factor\nanalysis (NBFA), where each of the sample J is assigned with a sample-speci\ufb01c GNBP and a globally\nshared gamma process is mixed with all the J gamma-negative binomial Markov chains (GMNBs).\nOur BMDL also uses hGNBP to model the counts in each domain, but imposes a spike and slab\nmodel to ensure domain-speci\ufb01c latent factors can be identi\ufb01ed.\nIn this paper, we propose a hierarchical Bayesian model\u2014BMDL\u2014for multi-domain learning by\nderiving domain-dependent latent representations of observed data across domains. By jointly\nderiving latent representations with both domain-speci\ufb01c and shared latent factors, we take the best\nadvantage of shared information across domains for effective multi-domain learning. In the context\nof cancer subtyping, we are interested in deriving such meaningful representations for accurate and\nreproducible subtyping in the target domain, where only a limited number of samples are available.\nWe will show \ufb01rst in our experiments that when the source and target data share more latent factors,\nwe can better help subtyping in the target domain with higher accuracy; more importantly, we will\nalso show that even when the domains are distantly related, our method can selectively integrate the\ninformation from other domain(s) to improve subtyping in the target domain while prohibit using\nirrelevant knowledge to avoid performance degradation.\n\n3 Method\n\nWe would like to model the observed counts n(d)\nvj from next-generation sequencing (NGS) for gene\nv \u2208 {1, ..., V } in sample j \u2208 {1, ..., Jd} of domain d \u2208 {1, ..., D} to help cancer subtyping. The\nmain modeling challenges here include: (1) NGS counts are often over-dispersed and requiring\n\n\u03c0k\n\nsk\n\n\u03b30\n\nc0\n\nzkd\n\nr(d)\nk\n\ncd\n\nc(d)\nj\n\n\u03b8(d)\nkj\n\nK\n\np(d)\nj\n\nn(d)\nvj\n\nj = 1, ..., Jd\n\nd = 1, ..., D\n\n\u03c6k\n\nK\n\nFigure 1: BMDL based on multi-domain negative binomial factorization model.\n\n3\n\n\fad-hoc pre-processing that may lead to biased results; (2) there are a much smaller number of samples\nwith respect to the number of genes (V (cid:29) J), especially in the target domain of interest; and (3) it is\noften unknown how relevant/similar the samples across different domains are so that forcing the joint\nlearning may lead to degraded performance.\nWe construct a Bayesian Multi-Domain Learning (BMDL) framework based on a domain-dependent\nlatent negative binomial (NB) factor model for NGS counts so that (1) over-dispersion is appropriately\nmodeled and ad-hoc pre-processing is not needed; (2) low-dimensional representations of counts\nin different domains can help achieve more robust subtyping results; and most importantly, (3) the\nsample relevance across domains can be explicitly learned to guarantee the effectiveness of joint\nlearning across multiple domains.\nBMDL achieves \ufb02exible multi-domain learning by \ufb01rst constructing a NB factorization model of\nNGS counts, and then explicitly establishing the relevance of samples across different domains\nby introducing domain-dependent binary variables that assign latent factors to each domain. The\ngraphical representation of BMDL is illustrated in Fig. 1.\nWe model NGS counts n(d)\n\nvj based on the following representations\n\nn(d)\nvj =\n\nn(d)\nvjk, n(d)\n\nvjk \u223c NB\n\n\u03c6vk\u03b8(d)\n\nkj , p(d)\n\nj\n\n,\n\n(1)\n\nK(cid:88)\n\nk=1\n\n(cid:16)\n\n(cid:17)\n\nvj is factorized by K sub-counts n(d)\n\nwhere n(d)\nvjk, each of which is a latent factor distributed according\nto a NB distribution. The factor loading parameter \u03c6vk quanti\ufb01es the association between gene v\nand latent factor k, while the score parameter \u03b8(d)\nkj captures the popularity of factor k in sample j of\ndomain d. It should be noted that the factor loadings are shared across all domains, and thus making\ntheir inference more robust when the number of samples is low, especially in the target domain. This\ndoes not put a restriction on the model \ufb02exibility in capturing the inter-domain variability as the score\nparameters determine the signi\ufb01cance of corresponding latent factors across domains. The score\nparameter \u03b8(d)\n\nkj is assumed to follow a gamma distribution:\n\n(cid:16)\n\n(cid:17)\n\nkj \u223c Gamma\n\u03b8(d)\n\nr(d)\nk , 1/c(d)\n\nj\n\n,\n\n(2)\n\nj modeling the variability of sample j of domain d and the shape parameter\ncapturing the popularity of factor k in domain d. To further enable domain-dependent latent\n\nwith the scale parameter c(d)\nr(d)\nk\nrepresentations, we introduce another hierarchical layer on the shape parameter:\n\nk \u223c Gamma (skzkd, 1/cd) ,\nr(d)\n\n(3)\nwhere the set of binary latent variables zkd are considered as domain-dependent selector variables\nto allow different latent representations with the corresponding r(d)\nk being present or absent across\ndomains: When zkd = 1, the latent factor k is present for factorization of counts in domain d;\nand it is absent otherwise. In our multi-domain learning framework, as the sample relevance across\ndomains can vary signi\ufb01cantly, this layer provides the additional model \ufb02exibility to model the sample\nrelevance in the given data across domains. In (3), sk is the global popularity of factor k in all domains.\nInspired by the beta-Bernoulli process [Thibaux and Jordan, 2007], whose marginal representation\nis also known as the Indian Buffet Process (IBP) [Ghahramani and Grif\ufb01ths, 2006], and its use in\nnonparametric Bayesian sparse factor analysis [Zhou et al., 2009], we impose a beta-Bernoulli prior\nto the assignment variables:\n(4)\nwhich can be seen as an in\ufb01nite spike-and-slab model as K \u2192 \u221e, where the spikes are provided by\nthe beta-Bernoulli process and the slab is provided by the top-level gamma process. As a result, the\nproposed model assigns positive probability to only a subset of latent factors, selected independently\nof their masses.\nWe further complete the hierarchical Bayesian model for multi-domain learning by placing appropriate\npriors on the model parameters in (1), (2), (3) and (4):\n\n\u03c0k \u223c Beta(c/K, c(1 \u2212 1/K)),\n\nzkd \u223c Bernoulli(\u03c0k),\n\n(\u03c61k, . . . , \u03c6V k) \u223c Dir(\u03b7, . . . , \u03b7),\nj \u223c Gamma(e0, 1/f0),\nc(d)\n\n\u03b7 \u223c Gamma(s0, w0),\n\ncd \u223c Gamma(h0, 1/u0),\n\nj \u223c Beta(a0, b0),\np(d)\nsk \u223c Gamma(\u03b30/K, 1/c0),\n\n\u03b30 \u223c Gamma(a0, 1/b0),\n\nc0 \u223c Gamma(s0, 1/t0).\n\n(5)\n\n4\n\n\fkj\n\nj\n\nj\n\np(d))\nj\n1\u2212p(d)\n\nk=1 \u03c6vk\u03b8(d)\n\nk=1 \u03c6vk\u03b8(d)\n\nFrom a biological perspective, K factors may correspond to the underlying biological processes,\ncellular components, or molecular functions causing cancer subtypes, or more generally different\nphenotypes or treatment responses in biomedicine. The corresponding sub-counts n(d)\nvjk can be viewed\nas the result of the contribution of underlying biological process k to the expression of gene v in\nsample j of domain d. The probability parameter p(d)\n, which depends on the sample index, accounts\nfor the potential effect of varying sequencing depth of sample j in domain d. More precisely, the\n, and hence the term\n\nexpected expression of gene v in sample j and domain d is(cid:80)K\n((cid:80)K\n\nkj ) can be viewed as the true abundance of gene v in domain d, after adjusting for the\nsequencing depth variation across samples. Speci\ufb01cally, it comprises of contributions from both\ndomain-dependent and globally shared latent factors, where the amount of contribution of each latent\nfactor can automatically be learned for the sample relevance across domains.\nGiven the BMDL model in Fig. 1, we derive an ef\ufb01cient Gibbs sampling algorithm with closed-form\nupdating steps for inferring the model parameters by exploiting the data augmentation technique\nin Zhou and Carin [2015]. The detailed Gibbs sampling procedure is provided in the supplemental\nmaterials.\nFor real-world NGS datasets that are deeply sequenced and thus possess large counts, the steps in\nGibbs sampling involving the Chinese Restaurant Table (CRT) distribution in Zhou and Carin [2015]\nare the source of main computational burden. To speed up sampling from CRT, we propose the\nfollowing scheme: to draw (cid:96) \u223c CRT(n, r), when n is large, we \ufb01rst draw (cid:96)1 \u223c CRT(m, r), where\nm (cid:28) n. Then, we draw (cid:96)2 \u223c Pois (r[\u03c8(n + r) \u2212 \u03c8(m + r)]), where \u03c8(\u00b7) is the digamma function.\nFinally, we have (cid:96) \u2248 (cid:96)1 + (cid:96)2. This approximation is inspired by Le Cam\u2019s theory [Le Cam, 1960],\nand reduces the number of Bernoulli draws required for CRT from n to m, and hence speeding up the\nGibbs sampling substantially in our experiments with TCGA NGS data, where it is not uncommon\nfor n > 105.\n\n4 Experimental Results\n\nTo verify the advantages of our BMDL model with the \ufb02exibility to capture the varying sample\nrelevance across domains with both domain-speci\ufb01c and globally shared latent factors, we have\ndesigned experiments based on both simulated data and RNA-seq count data from TCGA [The\nCancer Genome Atlas Research Network et al., 2008]. We have implemented BMDL to extract\ndomain-dependent low-dimensional latent representations and then examined how well using these\nextracted representations in an unsupervised manner can subtype new testing samples. We also have\ncompared the performance of BMDL to other Bayesian latent models for multi-domain learning,\nincluding\n\u2022 NB-HDP [Zhou and Carin, 2012], for which all domains are assumed to share a set of latent factors.\nThis is done by involving a simple Bayesian hierarchy where the base measure for the child DPs is\nitself distributed according to a DP. It assumes the probability parameter of NB is \ufb01xed at p(d)\nj = 0.5.\n\u2022 HDP-NBFA: To have fair comparison and make sure that the superior performance of BMDL\nis not only due to the modeling of the sequencing depth variation across samples, we apply HDP\nto model latent scores in NB factorization as well. More speci\ufb01cally we model count data as\njk \u223c NB(\u03c6k\u03b8(d)\nkj , p(d)\nn(d)\nkj is hierarchical DP instead of hierarchical gamma process in our\nmodel. Fixing c(d)\nj = 1 in ( 2) is considered here to construct an HDP, whose group-level DPs are\nnormalized from gamma processes with the scale parameters as 1/c(d)\n\u2022 hGNBP [Zhou, 2018]: To evaluate the advantages of the beta-Bernoulli modeling in BMDL, we\ncompare the results with hGNBP, which models count data as n(d)\nj ). Here, \u03b8(d)\nis a hierarchical gamma process and the parameter zkd in (4) is set to 1.\nWe illustrate that BMDL leads to more effective target domain learning compared to both HDP\nand hGNBP based models by assigning domain-speci\ufb01c latent factors to domains (using the beta-\nBernoulli process) given observed samples, while learning the latent representations globally in a\nsimilar fashion as HDP and hGNBP. In addition, we also have compared with hGNBP-NBFA [Zhou,\n\njk \u223c NB(\u03c6k\u03b8(d)\n\nj ), where \u03b8(d)\n\nkj , p(d)\n\nj = 1.\n\nkj\n\n5\n\n\fFigure 2: The classi\ufb01cation error of BMDL and hGNBP-NBFA as a function of (a) domain relevance,\nand (b) the number of target samples.\n\n2018], which can be considered as the baseline model as it extracts latent representations only using\nthe samples from the target domain. Comparing to this baseline, we expect to show that BMDL\neffectively borrows the signal strength across domains to improve classi\ufb01cation accuracy in a target\ndomain with very small samples.\nFor all the experiments, we \ufb01x the truncation level K = 100 and consider 3,000 Gibbs sampling\niterations, and retain the weights {r(d)\nk }1,K and the posterior means of {\u03c6k}1,K as factors, and use the\nlast Markov chain Monte Carlo (MCMC) sample for the test procedure. With these K inferred factors\nand weights, we further apply 1,000 blocked Gibbs sampling iterations and collect the last 500 MCMC\nsamples to estimate the posterior mean of the latent factor score \u03b8(dt)\n, for every sample of target\ndomain dt in both the training and testing sets. We then train a linear support vector machine (SVM)\nclassi\ufb01er [Sch\u00f6lkopf and Smola, 2002] on all \u00af\u03b8(dt)\nin the training set and use it to classify each \u00af\u03b8(dt)\n\u2208 RK is the estimated feature vector for sample j in the target domain.\nin the test set, where \u00af\u03b8(dt)\nFor each binary classi\ufb01cation task, we report the classi\ufb01cation accuracy based on ten independent\nruns. Note that although we \ufb01x K with a large enough value, we expect only a small subset of the K\nlatent factors to be used and all the others to be shrunken towards zero. More precisely, inspired by\nthe inherent shrinkage property of the gamma process, we have imposed Gamma(\u03b30/K, 1/c0) as\nthe prior on each factor strength parameter sk, leading to a truncated approximation of the gamma\nprocess using K atoms.\n\nj\n\nj\n\nj\n\nj\n\n4.1 Synthetic data experiments\n\nFor synthetic experiments, we compare BMDL and the baseline hGNBP-NBFA using only target\nsamples to illustrate multi-domain learning can help better prediction in the target domain.\nFor the \ufb01rst set of synthetic data experiments, we generate the varying sample relevance across\ndomains. The degree of relevance is controlled by varying the number of latent factors shared by the\ndomains. In this setup, we set two domains, 1,000 features, 50 latent factors per domain, 200 samples\nin the source domain, and 20 samples in the target domain while the number of samples for both\nclasses is 10. The number of shared latent factors between two domains changes from 50 to 0 to cover\ndifferent degree of domain relevance. The factor loading matrix of the \ufb01rst domain is generated based\non a Dirichlet distribution. To simulate the loading matrix for the second domain, we \ufb01rst select NKc\nshared latent factors from the \ufb01rst domain, and then randomly generate 50 \u2212 NKc latent factors as\nunique ones for the second domain, where NKc \u2208 {0, 10, 20, . . . , 50}. The dispersion parameters\nof both domains are generated from a gamma process: Gamma(sk, 1/cd), where sk is generated\nby Gamma(\u03b30/K, 1/c0). The hyperparameters \u03b30, and c0 are drawn from Gamma(0.01, 0.01). To\ndistinguish two classes of generated samples in the target domain, we generate their factor scores\n\n6\n\n50403020100Number of shared latent factors0.050.10.150.20.250.30.35Classification errorBMDLhGNBP-NBFA10203040Number of target samples0.050.10.150.20.250.30.350.40.450.5BMDLhGNBP-NBFA(b)(a)\fj \u223c Gamma(a, 0.01), where a is set to be 100 and 150 in the \ufb01rst\nby different scale parameters c(d)\nand second class, respectively. From Figure 2(a), the \ufb01rst interesting observation is that BMDL\nautomatically avoids \u201cnegative transfer\u201d: the classi\ufb01cation errors of BMDL by jointly learning the\nlatent representations are consistently lower than the classi\ufb01cation errors using only the target domain\ndata no matter how many shared latent factors exist across simulated domains. Furthermore, the\nclassi\ufb01cation error in the target domain decreases monotonically with the number of shared latent\nfactors, which agrees with our intuition that BMDL can achieve higher predictive power when data\nacross domains are more relevant. This demonstrates that the number of shared latent factors across\ndomains may serve as a new measure of the domain relevance.\nIn the second simulation study, we investigate how the number of target samples affects the classi\ufb01ca-\ntion performance. In this simulation setup, we simulate two related domains with 40 shared latent\nfactors out of 50 total ones. The number of samples in the target domain is changing from 10 to\n40, keeping the other setups the same as in the \ufb01rst experiment. Figure 2(b) shows that increasing\nthe number of target samples will improve the performance of both the baseline hGNBP-NBFA\nusing only target data and BMDL integrating data across domains, which is again expected. More\ninterestingly, the improvement of BMDL over hGNBP-NBFA decreases with the number of target\nsamples, which agrees with the general trend shown in the TL/MTL literature [Pardoe and Stone,\n2010, Karbalayghareh et al., 2018] that the prediction performance eventually converges to the\noptimal Bayes error when there are enough samples in the target domain.\n\n4.2 Case study: Lung cancer\n\nWe consider two setups of analyzing RNA-seq count data from the studies on two subtypes of lung\ncancer, i.e. Lung Adenocarcinoma (LUAD) and Lung Squamous Cell Carcinoma (LUSC) from\nTCGA [The Cancer Genome Atlas Research Network et al., 2008]. First, we take two types of\nNGS data, RNA-seqV2 and RNA-seq of the same lung cancer study, as two highly-related domains\nsince the source and target domain difference is simply due to pro\ufb01ling techniques. Second, we\nuse RNA-seq data from a Head and Neck Squamous Cell Carcinoma (HNSC) cancer study as the\nsource domain and the above RNA-seq lung cancer data as the target domain. These are considered\nas low-related domains as these two cancer types have quite different disease mechanisms. In this\nset of experiments, we take 10 samples for each subtype of lung cancer in the target domain to test\ncancer subtyping performance. We also investigate the effect of the number of source samples, Ns,\non cancer subtyping in the target domain by setting Ns = 25 and 100.\nFor all the TCGA NGS datasets, we \ufb01rst have selected the genes appeared in all the datasets and then\n\ufb01ltered out the genes whose total read counts across samples are less than 50, resulting in roughly\n14,000 genes in each dataset. We \ufb01rst have divided the lung cancer datasets into training and test\nsets, and then the differential gene expression analysis has been performed on the training set using\nDeSeq2 [Love et al., 2014], by which 1,000 out of the top 5,000 genes with higher log2 fold change\nbetween LUAD and LUSC have been selected for consequent analyses. We \ufb01rst check the subtyping\naccuracy by directly applying linear SVM to the raw counts in the target domain, which gives an\naverage accuracy of 59.28% with a sample standard deviation (STD) of 5.54% from ten independent\nruns. We also transform the count data to standard normal data after removing the sequencing depth\neffect using DESeq2 [Love et al., 2014] and then apply regularized logistic regression provided by\nthe LIBLINEAR (https://www.csie.ntu.edu.tw/~cjlin/liblinear/) package [Fan et al.,\n2008]. The classi\ufb01cation accuracy becomes 74.10% \u00b1 4.41%.\nTable 1 provides cancer subtyping performance comparison between BMDL, NB-HDP, HDP-NBFA,\nhGNBP, as well as the baseline hGNBP-NBFA using only the samples form the target domain. In\nfact, when analyzing data across highly-related domains of lung cancer, from the identi\ufb01ed 100 latent\nfactors in the target domain by BMDL, there are 98 shared ones between two RNA-seq techniques.\nWhile for low-related domains of lung cancer and HMSC, only 25 of 62 extracted latent factors in\nlung cancer by BMDL are shared with HNSC. This is consistent with our biological knowledge\nregarding the sample relevance in two setups. From the table, BMDL consistently achieves better\ncancer subtyping in both highly- and low-related setups. On the contrary, as the results show, not only\nHDP based methods cannot improve the results in the low-related setup, but also the performance\nwill be degraded with more severe \u201cnegative transfer\u201d adversarial effects when using more source\nsamples. The reason for this is that HDP assumes a latent factor with higher weight in the shared\nDP will occur more frequently within each sample [Williamson et al., 2010]. This might be an\n\n7\n\n\fTable 1: Lung cancer subtyping results (average accuracy (%) and STD)\n\nhighly-related (Ns)\n\nlow-related (Ns)\n\nMethod\nNB-HDP\nHDP-NBFA\nhGNBP\nBMDL\nhGNBP-NBFA\n\n25\n55.22 \u00b1 3.69\n63.48 \u00b1 1.23\n74.13 \u00b1 7.07\n78.46 \u00b1 5.97\n\n100\n56.52 \u00b1 4.61\n65.65 \u00b1 4.22\n77.61 \u00b1 3.54\n81.49 \u00b1 5.12\n\n25\n54.57 \u00b1 7.73\n54.89 \u00b1 7.38\n72.94 \u00b1 1.70\n78.85 \u00b1 4.55\n\n100\n53.83 \u00b1 7.79\n51.83 \u00b1 8.32\n74.55 \u00b1 8.84\n78.10 \u00b1 5.65\n\n73.38 \u00b1 7.29\n\nj\n\nundesirable assumption, especially when the domains are distantly related. For example, a latent\nfactor might not be present throughout the HNSC samples but dominant within the samples of lung\ncancer. HDP based methods are not able to discover these latent factors given observed samples due\nto the limited number of lung cancer samples. In addition to this undesirable assumption, NB-HDP\ndoes not account for the sequencing-depth heterogeneity of different samples, which may lead to\nbiased results deteriorating subtyping performance as shown in Table 1.\nHDP-NBFA explores the advantages of modeling the NB dispersion and improves over the NB-HDP\ndue to the \ufb02exibility of learning p(d)\n, especially in the highly-related setup. This demonstrates the\nbene\ufb01ts of inferring the sequencing depth in RNA-seq count applications. Although in highly-related\nsetup the HDP-NBFA performance has been improved with the increasing number of source samples,\nwe still observe the same \u201cnegative transfer\u201d effect in the low-related setup. Again, integrating\nmore source samples is bene\ufb01cial when the samples across domains are highly relevant but it can be\ndetrimental when the relevance assumption does not hold as both NB-HDP and HDP-NBFA force a\nsimilar structure of latent factors across domains.\nThe better performance of the gamma process based models compared to HDP based models, in\nboth scenarios with low and high domain relevance, may be explained by the negative correlation\nstructure that the Dirichlet process imposes on the weights of latent factors, while the gamma process\nmodels these weights independently, and hence allowing more \ufb02exibility for adjustment of latent\nrepresentations across domains. On the other hand, when comparing the performance of BMDL and\nhGNBP, domain-speci\ufb01c latent factor assignment using the beta-Bernoulli process can be considered\nas the main reason for the superior performance of BMDL, especially in the low-related setup.\nCompared to the baseline hGNBP-NBFA, BMDL can clearly improve cancer subtyping performance.\nEven using a small portion of the related source domain samples, the subtyping accuracy can be\nimproved up around 5%. With more highly-related source samples, the improvement can be up\nto 8%. Compared to the HDP based methods, BMDL can achieve up to 16% improvement in the\nhighly-related setup due to the bene\ufb01ts brought by the gamma process modeling of count data instead\nof using DP in HDP models, which forces negative correlation and restricts the distribution over\nlatent factor abundance [Williamson et al., 2010]. Compared to hGNBP, BMDL can achieve up to 4%\nand 6% accuracy improvement, respectively, in highly- and low-related setups due to domain-speci\ufb01c\nlatent factor assignment using the beta-Bernoulli process. Since the selector variables zkd in BMDL\nhelp to assign only a \ufb01nite number of latent factors for each domain, it is suf\ufb01cient merely to ensure\nthat the sum of any \ufb01nite subset of top-level atoms is \ufb01nite. This eliminates the restrictions on factor\nscore parameters imposed by DP, and improves subtyping accuracy since the latent factor abundance\nis independent.\nBMDL also does not have any restriction on the number of domains and can be applied to more\nthan two domains. To show this, we also have done another case study with three domains using\nboth the highly- and low-related TCGA datasets. The accuracy of BMDL is 79.71% \u00b1 5.32% and\n81.96% \u00b1 4.96% when using N (ds1)\n= 25 and 100 samples for two source domains\nas described earlier, respectively. Compared to one source and one target domain with 25 source\nsamples, the accuracy of using three domains has improved by 1%. Having two source domains\nwith more samples (N (ds1)\n= 50) leads to more robust estimation of \u03c6vk and improves the\nsubtyping accuracy. When there are enough number of samples (N (ds1)\n= 100) in highly-related\ndomain, adding another low-related domain does not improve the subtyping results. But it is notable\n\n= N (ds2)\n\ns\n\ns\n\ns\n\n+ N (ds2)\n\ns\n\ns\n\n8\n\n\fthat the accuracy has increased around 4% when adding the highly-related domain with 100 samples\nto 100 low-related samples. The results show that 1) using more domains with more samples does\nhelp subtyping in the target domain; 2) BMDL avoids negative transfer even when adding samples\nfrom low-related domains.\nWe would like to emphasize again that, unlike existing methods, BMDL infers the domain relevance\ngiven in the data and derive domain-adaptive latent factors to improve predictive power in the target\ndomain, regardless of the degree of domain relevance. This is important in real-world setups when\nthe samples across domains are distantly related or the sample relevance is uncertain. As the results\nhave demonstrated, BMDL achieves the similar performance improvement in the low-related setup\nas in the highly-related setup without \u201cnegative transfer\u201d symptom, often witnessed in existing\nTL/MTL methods. It shows the great potential for effective data integration and joint learning even\nin the low-related setup: the performance is better than competing methods as well as the baseline\nhGNBP-NBFA using only target samples and increasing the number of source samples does not hurt\nthe performance.\n\n5 Conclusions\n\nWe have developed a multi-domain NB latent factorization model, tailored for Bayesian multi-domain\nlearning of NGS count data\u2014BMDL. By introducing this hierarchical Bayesian model with selector\nvariables to \ufb02exibly assign both domain-speci\ufb01c and globally shared latent factors to different domains,\nthe derived latent representations of NGS data preserves predictive information in corresponding\ndomains so that accurate cancer subtyping is possible even with a limited number of samples. As\nBMDL learns domain relevance based on given samples across domains and enables the \ufb02exibly of\nsharing useful information through common latent factors (if any), BMDL performs consistently\nbetter than single-domain learning regardless of the domain relevance level. Our experiments have\nshown the promising potential of BMDL in accurate and reproducible cancer subtyping with \u201csmall\u201d\ndata through effective multi-domain learning by taking advantage of available data from different\nsources.\nAcknowledgements We would like to thank Dr. Sahar Yarian for insightful discussions. We also\nthank Texas A&M High Performance Research Computing and Texas Advanced Computing Center\nfor providing computational resources to perform experiments in this work. This work was supported\nin part by the NSF Awards CCF-1553281, IIS-1812641, and IIS-1812699.\n\nReferences\nA. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in neural information\n\nprocessing systems, pages 41\u201348, 2007.\n\nC. Chelba and A. Acero. Adaptation of maximum entropy capitalizer: Little data can help a lot. Computer\n\nSpeech & Language, 20(4):382\u2013399, 2006.\n\nS. Z. Dadaneh, X. Qian, and M. Zhou. BNP-Seq: Bayesian nonparametric differential expression analysis\nof sequencing count data. Journal of the American Statistical Association, 113(521):81\u201394, 2018. doi:\n10.1080/01621459.2017.1328358.\n\nR.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classi\ufb01cation.\n\nJournal of machine learning research, 9(Aug):1871\u20131874, 2008.\n\nZ. Ghahramani and T. L. Grif\ufb01ths. In\ufb01nite latent feature models and the Indian buffet process. In Advances in\n\nneural information processing systems, pages 475\u2013482, 2006.\n\nE. Hajiramezanali, S. Z. Dadaneh, P. de Figueiredo, S.-H. Sze, M. Zhou, and X. Qian. Differential expression\nanalysis of dynamical sequencing count data with a gamma Markov chain. arXiv preprint arXiv:1803.02527,\n2018.\n\nL. Jacob, J.-P. Vert, and F. R. Bach. Clustered multi-task learning: A convex formulation. In Advances in neural\n\ninformation processing systems, pages 745\u2013752, 2009.\n\nZ. Kang, K. Grauman, and F. Sha. Learning with whom to share in multi-task feature learning. In ICML, pages\n\n521\u2013528, 2011.\n\n9\n\n\fA. Karbalayghareh, X. Qian, and E. R. Dougherty. Optimal Bayesian Transfer Learning. IEEE Transactions on\n\nSignal Processing, 2018.\n\nA. Kumar and H. Daume III. Learning task grouping and overlap in multi-task learning. arXiv preprint\n\narXiv:1206.6417, 2012.\n\nL. Le Cam. An approximation theorem for the Poisson binomial distribution. Paci\ufb01c Journal of Mathematics,\n\n10(4):1181\u20131197, 1960.\n\nM. I. Love, W. Huber, and S. Anders. Moderated estimation of fold change and dispersion for RNA-seq data\n\nwith DESeq2. Genome biology, 15(12):550, 2014.\n\nS. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering,\n\n22(10):1345\u20131359, 2010.\n\nD. Pardoe and P. Stone. Boosting for regression transfer. In Proceedings of the 27th International Conference on\n\nInternational Conference on Machine Learning, pages 863\u2013870. Omnipress, 2010.\n\nA. Passos, P. Rai, J. Wainer, and H. Daume III. Flexible modeling of latent task structures in multitask learning.\n\narXiv preprint arXiv:1206.6486, 2012.\n\nV. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A survey of recent advances. IEEE\n\nsignal processing magazine, 32(3):53\u201369, 2015.\n\nP. Rai and H. Daume III. In\ufb01nite predictor subspace models for multitask learning. In Proceedings of the\n\nThirteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pages 613\u2013620, 2010.\n\nB. Sch\u00f6lkopf and A. J. Smola. Learning with kernels: Support vector machines, regularization, optimization,\n\nand beyond. MIT press, 2002.\n\nY. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Sharing clusters among related groups: Hierarchical\n\nDirichlet processes. In Advances in neural information processing systems, pages 1385\u20131392, 2005.\n\nThe Cancer Genome Atlas Research Network et al. Comprehensive genomic characterization de\ufb01nes human\n\nglioblastoma genes and core pathways. Nature, 455(7216):1061, 2008.\n\nR. Thibaux and M. I. Jordan. Hierarchical beta processes and the Indian buffet process. In Arti\ufb01cial Intelligence\n\nand Statistics, pages 564\u2013571, 2007.\n\nS. Williamson, C. Wang, K. Heller, and D. Blei. The IBP compound Dirichlet process and its application to\n\nfocused topic modeling. 2010.\n\nY. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classi\ufb01cation with Dirichlet process\n\npriors. Journal of Machine Learning Research, 8(Jan):35\u201363, 2007.\n\nM. Zhou. Nonparametric Bayesian negative binomial factor analysis. Bayesian Analysis, pages 1061\u20131089,\n\n2018.\n\nM. Zhou and L. Carin. Augment-and-conquer negative binomial processes. In Advances in Neural Information\n\nProcessing Systems, pages 2546\u20132554, 2012.\n\nM. Zhou and L. Carin. Negative binomial process count and mixture modeling. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 37(2):307\u2013320, 2015.\n\nM. Zhou, H. Chen, L. Ren, G. Sapiro, L. Carin, and J. W. Paisley. Non-parametric Bayesian dictionary learning\nfor sparse image representations. In Advances in neural information processing systems, pages 2295\u20132303,\n2009.\n\n10\n\n\f", "award": [], "sourceid": 5458, "authors": [{"given_name": "Ehsan", "family_name": "Hajiramezanali", "institution": "Texas A&M University"}, {"given_name": "Siamak", "family_name": "Zamani Dadaneh", "institution": "Texas A&M University"}, {"given_name": "Alireza", "family_name": "Karbalayghareh", "institution": "Texas A&M University"}, {"given_name": "Mingyuan", "family_name": "Zhou", "institution": "University of Texas at Austin"}, {"given_name": "Xiaoning", "family_name": "Qian", "institution": "Texas A&M"}]}