{"title": "General Table Completion using a Bayesian Nonparametric Model", "book": "Advances in Neural Information Processing Systems", "page_first": 981, "page_last": 989, "abstract": "Even though heterogeneous databases can be found in a broad variety of applications, there exists a lack of tools for estimating missing data in such databases. In this paper, we provide an efficient and robust table completion tool, based on a Bayesian nonparametric latent feature model. In particular, we propose a general observation model for the Indian buffet process (IBP) adapted to mixed continuous (real-valued and positive real-valued) and discrete (categorical, ordinal and count) observations. Then, we propose an inference algorithm that scales linearly with the number of observations. Finally, our experiments over five real databases show that the proposed approach provides more robust and accurate estimates than the standard IBP and the Bayesian probabilistic matrix factorization with Gaussian observations.", "full_text": "General Table Completion using a Bayesian\n\nNonparametric Model\n\nIsabel Valera\n\nDepartment of Signal Processing\n\nand Communications\n\nUniversity Carlos III in Madrid\nivalera@tsc.uc3m.es\n\nZoubin Ghahramani\n\nDepartment of Engineering\nUniversity of Cambridge\n\nzoubin@eng.cam.ac.uk\n\nAbstract\n\nEven though heterogeneous databases can be found in a broad variety of applica-\ntions, there exists a lack of tools for estimating missing data in such databases. In\nthis paper, we provide an ef\ufb01cient and robust table completion tool, based on a\nBayesian nonparametric latent feature model. In particular, we propose a general\nobservation model for the Indian buffet process (IBP) adapted to mixed continuous\n(real-valued and positive real-valued) and discrete (categorical, ordinal and count)\nobservations. Then, we propose an inference algorithm that scales linearly with\nthe number of observations. Finally, our experiments over \ufb01ve real databases show\nthat the proposed approach provides more robust and accurate estimates than the\nstandard IBP and the Bayesian probabilistic matrix factorization with Gaussian\nobservations.\n\n1\n\nIntroduction\n\nA full 90% of all the data in the world has been generated over the last two years and this expansion\nrate will not diminish in the years to come [17]. This extreme availability of data explains the great\ninvestment that both the industry and the research community are expending in data science. Data is\nusually organized and stored in databases, which are often large, noisy, and contain missing values.\nMissing data may occur in diverse applications due to different reasons. For example, a sensor in\na remote sensor network may be damaged and transmit corrupted data or even cease to transmit;\nparticipants in a clinical study may drop out during the course of the study; or users of a recom-\nmendation system rate only a small fraction of the available books, movies, or songs. The presence\nof missing values can be challenging when the data is used for reporting, information sharing and\ndecision support, and as a consequence, missing data treatment has captured the attention in diverse\nareas of data science such as machine learning, data mining, and data warehousing and management.\nSeveral studies have shown that probabilistic modeling can help to estimate missing values, detect\nerrors in databases, or provide probabilistic responses to queries [19]. In this paper, we exclusively\nfocus on the use of probabilistic modeling for missing data estimation, and assume that the data\nare missing completely at random (MCAR). There is extensive literature in probabilistic missing\ndata estimation and imputation in homogeneous databases, where all the attributes that describe\neach object in the database present the same (continuous or discrete) nature. Most of the work\nassumes that databases contain only either continuous data, usually modeled as Gaussian variables\n[21], or discrete, that can be either modeled by discrete likelihoods [9] or simply treated as Gaussian\nvariables [15, 21]. However, there still exists a lack of work dealing with heterogeneous databases,\nwhich in fact are common in real applications and where the standard approach is to treat all the\nattributes, either continuous or discrete, as Gaussian variables. As a motivating example, consider a\ndatabase that contains the answers to a survey, including diverse information about the participants\nsuch as age (count data), gender (categorical data), salary (continuous non negative data), etc.\n\n1\n\n\fIn this paper, we provide a general Bayesian approach for estimating and replacing the missing data\nin heterogeneous databases (being the data MCAR), where the attributes describing each object can\nbe either discrete, continuous or mixed variables. Speci\ufb01cally, we account for real-valued, positive\nreal-valued, categorical, ordinal and count data. To this end, we assume that the information in\nthe database can be stored in a matrix (or table), where each row corresponds to an object and\nthe columns are the attributes that describe the different objects. We propose a novel Bayesian\nnonparametric approach for general table completion based on feature modeling, in which each\nobject is represented by a set of latent variables and the observations are generated from a distribution\ndetermined by those latent features. Since the number of latent variables needed to explain the data\ndepends on the speci\ufb01c database, we use the Indian buffet process (IBP) [8], which places a prior\ndistribution over binary matrices where the number of columns (latent variables) is unbounded.\nThe standard IBP assumes real-valued observations combined with conjugate likelihood models\nthat allow for fast inference algorithms [4]. Here, we aim at dealing with heterogeneous databases,\nwhich may contain mixed continuous and discrete observations.\nWe propose a general observation model for the IBP that accounts for mixed continuous and dis-\ncrete data, while keeping the properties of conjugate models. This allows us to propose an inference\nalgorithm that scales linearly with the number of observations. The proposed algorithm does not\nonly infer the latent variables for each object in the table, but it also provides accurate estimates for\nits missing values. Our experiments over \ufb01ve real databases show that our approach for table com-\npletion outperforms, in terms of accuracy, the Bayesian probabilistic matrix factorization (BPMF)\n[15] and the standard IBP which assume Gaussian observations. We also observe that the approach\nbased on treating mixed continuous and discrete data as Gaussian fails in estimating some attributes,\nwhile the proposed approach provides robust estimates for all the missing values regardless of their\ndiscrete or continuous nature.\nThe main contributions in this paper are: i) A general observation model (for mixed continuous and\ndiscrete data) for the IBP that allows us to derive an inference algorithm that scales linearly with\nthe number of objects, and its application to build ii) a general and scalable tool to estimate missing\nvalues in heterogeneous databases. An ef\ufb01cient C-code implementation for Matlab of the proposed\ntable completion tool is also released on the authors website.\n\n2 Related Work\nIn recent years, probabilistic modeling has become an attractive option for building database man-\nagement systems since it allows estimating missing values, detecting errors, visualizing the data, and\nproviding probabilistic answers to queries [19]. BayesDB,1 for instance, is a database management\nsystem that resorts to Crosscat [18], which originally appeared as a Bayesian approach to model hu-\nman categorization of objects. BayesDB provides missing data estimates and probabilistic answer\nto queries, but it only considers Gaussian and multinomial likelihood functions.\nIn the literature, probabilistic low-rank matrix factorization approaches have been broadly applied to\ntable completion (see, e.g., [14, 15, 21]). In these approaches, the table database X is approximated\nby a low-rank matrix representation X \u2248 ZB, where Z and B are usually assumed to be Gaussian\ndistributed. Most of the works in this area have focused on building automatic recommendation\nsystems, which appears as the most popular application of missing data estimation [14, 15, 21].\nMore speci\ufb01c models to build recommendation systems can be found in [7, 22], where the authors\nassume that the rates each user assign to items are generated by a probabilistic generative model\nwhich, based on the available data, accounts for similarities among users and among items to provide\ngood estimates of the missing rates.\nProbabilistic matrix factorization can also be viewed as latent feature modeling, where each object\nis represented by a vector of continuous latent variables. In contrast, the IBP and other latent feature\nmodels (see, e.g., [16]) assume binary latent features to represent each object. Latent feature models\nusually assume homogeneous databases with either real [14, 15, 21] or categorical data [9, 12, 13],\nand only a few works consider heterogeneous data, such as mixed real and categorical data [16].\nHowever, up to our knowledge, there are no general latent feature models (nor table completion\ntools) to directly deal with heterogeneous databases. To \ufb01ll this gap, in this paper we provide a\ngeneral table completion approach for heterogeneous databases, based on a generalized IBP, that\nallows for ef\ufb01cient inference.\n\n1http://probcomp.csail.mit.edu/bayesdb/\n\n2\n\n\f3 Model Description\nLet us assume a table with N objects, where each object is de\ufb01ned by D attributes. We can store\nthe data in an N \u00d7 D observation matrix X, in which each D-dimensional row vector is denoted by\nn. We consider that column vectors xd (i.e., each\nxn = [x1\ndimension in the observation matrix X) may contain the following types of data:\n\nn ] and each entry is denoted by xd\n\nn, . . . , xD\n\n\u2022 Continuous variables:\n\n1. Real-valued, i.e., xd\n2. Positive real-valued, i.e., xd\n\nn \u2208 (cid:60)\n\nn \u2208 (cid:60)+.\n\n1. Categorical data, i.e., xd\n\n\u2022 Discrete variables:\n\u2018red\u2019, \u2018black\u2019}.\nn takes values in a \ufb01nite ordered set, e.g., xd\ntimes\u2019, \u2018often\u2019, \u2018usually\u2019, \u2018always\u2019}.\nn \u2208 {0, . . . ,\u221e},\n\n2. Ordinal data, i.e., xd\n\n3. Count data, i.e., xd\n\nn takes values in a \ufb01nite unordered set, e.g., xd\n\nn \u2208 {\u2018blue\u2019,\nn \u2208 {\u2018never\u2019, \u2018some-\n\nn can be explained by a K-length vector of latent variables\nWe assume that each observation xd\nassociated to the n-th data point zn = [zn1, . . . , znK] and a weighting vector2 Bd = [bd\nK]\n1, . . . , bd\nk weight the contribution of k-th the\n(being K the number of latent variables), whose elements bd\nlatent feature to the d-th dimension of X. We gather the latent binary feature vectors zn in a N \u00d7 K\nmatrix Z, which follows an IBP with concentration parameter \u03b1, i.e., Z \u223c IBP(\u03b1) [8]. We place a\nGaussian distribution with zero mean and covariance matrix \u03c32\nBIK over the weighting vectors Bd.\nFor convenience, zn is a K-length row vector, while Bd is a K-length column vector.\nTo accommodate for all kinds of observed random variables described above, we introduce an auxil-\nn, such that when conditioned on the auxiliary variables, the latent variable\niary Gaussian variable yd\nn is Gaus-\nmodel behaves as a standard IBP with Gaussian observations. In particular, we assume yd\nsian distributed with mean znBd and variance \u03c32\n\ny, i.e.,\nn|zn, Bd) = N (yd\n\np(yd\n\ny),\nn|znBd, \u03c32\n\nn to obtain the obser-\nand assume that there exists a transformation function over the variables yd\nn, mapping the real line (cid:60) into the observation space. The resulting generative model is\nvations xd\nshown in Figure 1, where Z is the IBP latent matrix, and Yd and Bd contain, respectively, the\nk for the d-dimension of the data. Ad-\nauxiliary Gaussian variables yd\nditionally, \u03a8d denotes the set of auxiliary random variables needed to obtain the observation vector\nxd given Yd, and Hd contains the hyper-parameters associated to the random variables in \u03a8d. This\nn are independent given the latent matrix Z, the weighting\nmodel assumes that the observations xd\nmatrices Bd and the auxiliary variables \u03a8d. Therefore, the likelihood can be factorized as\n\nn and the weighting factors bd\n\np(X|Z,{Bd, \u03a8d}D\n\nd=1) =\n\np(xd|Z, Bd, \u03a8d) =\n\np(xd\n\nn|zn, Bd, \u03a8d).\n\nD(cid:89)\n\nd=1\n\nD(cid:89)\n\nN(cid:89)\n\nd=1\n\nn=1\n\nNote that, if we assume Gaussian observations and set Yd = xd, this model resembles the standard\nIBP with Gaussian observations [8]. In addition, conditioned on the variables Yd, we can infer the\nlatent matrix Z as in the standard IBP. We also remark that auxiliary Gaussian variables to link a\nlatent model with the observations have been previously used in Gaussian processes for multi-class\nclassi\ufb01cation [6] and for ordinal regression [2]. However, up to our knowledge, this simple approach\nhas not been used to account for mixed continuous and discrete data, and the existent approaches\nfor the IBP with discrete observations propose non-conjugate likelihood models and approximate\ninference algorithms [12, 13].\n\n3.1 Likelihood Functions\n\nn to the corre-\nNow, we de\ufb01ne the set of transformations that map from the Gaussian variables yd\nsponding observations xd\nn. We consider that each dimension in the table X may contain any of the\ndiscrete or continuous variables detailed above, provide a likelihood function for each kind of data\nand, in turn, also a likelihood function for mixed data.\n\n2For convenience, we capitalized here the notation for the weighting vectors Bd.\n\n3\n\n\fReal-valued Data. In this case, we assume that xd = Yd in the model in Figure 1 and consider\nthe standard approach when dealing with real-valued observations, which consist of assuming a\nGaussian likelihood function. In particular, as in the standard linear-Gaussian IBP [8], we assume\nthat each observation xd\n\nn is distributed as\n\np(xd\n\nn|zn, Bd) = N (xd\n\ny).\nn|znBd, \u03c32\n\nPositive Real-valued Data. In order to obtain positive real-valued observations, i.e., xd\napply a transformation over yd\n\nn \u2208 (cid:60)+, we\nn that maps from the real numbers to the positive real numbers, i.e.,\n\nn + ud\nn is a Gaussian noise variable with variance \u03c32\n\nn),\nwhere ud\nu, and f : (cid:60) \u2192 (cid:60)+ is a monotonic differen-\ntiable function. By change of variables, we obtain the likelihood function for positive real-valued\nobservations as\n\nn = f(yd\nxd\n\nexp\n\n1\ny + \u03c32\nu)\n\n2(\u03c32\n\n\u2212\n\n(f\n\n\u22121(xd\n\nn) \u2212 znBd)2\n\n\u22121(xd\nn)\n\nf\n\n(cid:27)(cid:12)(cid:12)(cid:12)(cid:12) d\n\ndxd\nn\n\n(cid:12)(cid:12)(cid:12)(cid:12) , (1)\n\np(xd\n\nn|zn, Bd) =\n\n1(cid:113)\n\n2\u03c0(\u03c32\n\ny + \u03c32\nu)\n\n(cid:26)\n\nwhere f\u22121 : (cid:60)+ \u2192 (cid:60) is the inverse function of the transformation f(\u00b7), i.e, f\u22121(f(v)) = v. Note\nn, and therefore,\nthat in this case we resort to the Gaussian variable ud\n\u03a8d = ud\n\nn in order to obtain xd\n\nn from yd\n\nu.\nd and Hd = \u03c32\n\nCategorical Data. Now we account for categorical observations, i.e., each observation xd\nn can take\nvalues in the unordered index set {1, . . . , Rd}. Hence, assuming a multinomial probit model, we\ncan write\n(2)\n\nn = arg max\nxd\n\nr\u2208{1,...,Rd} yd\nnr,\n\nr, \u03c32\n\nnr|znbd\n\ny) where bd\n\nnr \u223c N (yd\n\nr denotes the K-length weighting vector, in which each bd\nbeing yd\nn taking value r. Note that, under this\nweights the in\ufb02uence of the k-th feature for the observation xd\nkr for\nlikelihood model, since we have a Gaussian auxiliary variable yd\neach possible value of the observation r \u2208 {1, . . . , Rd}, we need to gather all the weighting factors\nnr in the N \u00d7 Rd matrix Yd.\nkr in a K \u00d7 Rd matrix Bd, and all the Gaussian auxiliary variables yd\nbd\nnr is a Gaussian noise\nUnder this observation model, we can write yd\nn taking\nvariable with variance \u03c32\nvalue r \u2208 {1, . . . , Rd} as [6]\n\ny, and therefore, we can obtain the probability of each element xd\n\nnr and a weighting factor bd\n\nnr, where ud\n\nnr = znbd\n\nr + ud\n\nkr\n\np(xd\n\nn = r|zn, Bd) = E\n\np(u)\n\n\u03a6\n\nu + zn(bd\n\nj )\nr \u2212 bd\n\n,\n\n(3)\n\n(cid:16)\n\n(cid:34) Rd(cid:89)\n\nj=1\nj(cid:54)=r\n\n(cid:17)(cid:35)\n\nr states for the column in Bd (r \u2208 {1, . . . , Rd}), \u03a6(\u00b7) denotes the cumulative\np(u)[\u00b7] denotes expectation with respect to\n\nwhere subscript r in bd\ndensity function of the standard normal distribution and E\nthe distribution p(u) = N (0, \u03c32\ny).\nOrdinal Data. Consider ordinal data, in which each element xd\nset {1, . . . , Rd}. Then, assuming an ordered probit model, we can write\n\nn takes values in the ordered index\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\n1\n2\n\nif yd\nif \u03b8d\n\nn \u2264 \u03b8d\n1\n1 < yd\n\nn \u2264 \u03b8d\n\n2\n\nn =\nxd\n\n...\n\nRd\n\nif \u03b8d\n\nRd\u22121 < yd\nn\n\n(4)\n\nn is Gaussian distributed with mean znBd and variance \u03c32\n\nr for r \u2208\nwhere again yd\n{1, . . . , Rd \u2212 1} are the thresholds that divide the real line into Rd regions. We assume the thresh-\n\u03b8)I(\u03b8d\nolds \u03b8d\nr >\nr\u22121), where \u03b8d\nRd = +\u221e. As opposed to the categorical case, now we have a unique\n\u03b8d\n\nr are sequentially generated from the truncated Gaussian distribution \u03b8d\n\ny, and \u03b8d\nr|0, \u03c32\n\n0 = \u2212\u221e and \u03b8d\n\nr \u221d N (\u03b8d\n\n4\n\n\fn is determined by the region in which yd\n\nweighting vector Bd and a unique Gaussian variable yd\nof xd\nUnder the ordered probit model [2], the probability of each element xd\ncan be written as\n\nn falls.\n\n(cid:32)\nn = r|zn, Bd) = \u03a6\n\np(xd\n\nn for each observation xd\n\nn. Hence, the value\n\n(cid:33)\n\n(cid:32)\n\nn taking value r \u2208 {1, . . . , Rd}\n\n(cid:33)\n\nr \u2212 znBd\n\u03b8d\n\n\u03c3y\n\n\u2212 \u03a6\n\nr\u22121 \u2212 znBd\n\u03b8d\n\n\u03c3y\n\n.\n\n(5)\n\nIn count data each observation xd\n\nLet us remark that, if the d-dimension of the observation matrix contains ordinal data, the set of\nauxiliary variables reduces to the Gaussian thresholds \u03a8d = {\u03b8d\nCount Data.\nn \u2208\n{0, . . . ,\u221e}. Then, we assume\n(6)\nwhere (cid:98)v(cid:99) returns the \ufb02oor of v, that is the largest integer that does not exceed v, and f : (cid:60) \u2192 (cid:60)+\nis a monotonic differentiable function that maps from the real numbers to the positive real numbers.\nWe can therefore write the likelihood function as\n\nn takes non-negative integer values, i.e., xd\n\nRd\u22121} and Hd = \u03c32\n\u03b8.\n\nn = (cid:98)f(yd\nxd\n\n1, . . . , \u03b8d\n\nn)(cid:99),\n(cid:33)\n\n(cid:32)\n\n(cid:33)\n\nf\u22121(xd\n\nn + 1) \u2212 znBd\n\n\u03c3y\n\n\u2212 \u03a6\n\nf\u22121(xd\n\nn) \u2212 znBd\n\u03c3y\n\n(7)\n\n(cid:32)\nn|zn, Bd) = \u03a6\n\np(xd\n\nwhere f\u22121 : (cid:60)+ \u2192 (cid:60) is the inverse function of the transformation f(\u00b7).\n\nFigure 1: Generalized IBP for mixed continuous and discrete observations.\n\nInference Algorithm\n\nn from which we can obtain estimates for xd\n\n4\nIn this section we describe our algorithm for inferring the latent variables given the observation\nmatrix. Under our model, detailed in Section 3, the probability distribution over the observation\nmatrix is fully characterized by the latent matrices Z and {Bd}D\nd=1 (as well as the auxiliary variables\n\u03a8d). Hence, if we assume the latent vector zn for the n-th datapoint and the weighting factors\nBd (and the auxiliary variables \u03a8d) to be known, we have a probability distribution over missing\nn by sampling from this distribution,3 or\nobservations xd\nby simply taking either its mean, mode or median value. However, this procedure requires the latent\nmatrix Z and the latent weighting factors Bd (and \u03a8d) to be known.\nWe use Markov Chain Monte Carlo (MCMC) methods, which have been broadly applied to infer\nthe IBP matrix (see, e.g., in [8, 23, 20]). The proposed inference algorithm is summarized in Algo-\nrithm 1. This algorithm exploits the information in the available data to learn the similarities among\nthe objects (captured in our model by the latent feature matrix Z), and how these latent features\nshow up in the attributes that describe the objects (captured in our model by Bd). In Algorithm 1,\nwe \ufb01rst need to update the latent matrix Z. Note that conditioned on {Yd}D\nd=1, both the latent\nd=1 are independent of the observation matrix X. Ad-\nmatrix Z and the weighting matrices {Bd}D\nd=1 are Gaussian distributed, we can analytically marginalize\nditionally, since {Bd}D\nout the weighting matrices {Bd}D\nd=1|Z). Therefore, to infer the matrix Z, we\ncan apply the collapsed Gibbs sampler which presents better mixing properties than the uncollapsed\n\nd=1 to obtain p({Yd}D\n\nd=1 and {Yd}D\n\n3Note that sampling from this distribution might be computationally expensive. In this case, we can easily\nn by exploiting the structure of our model. In particular, we can simply sample the auxiliary\nn given zn and Bd, and then obtain an estimate for xd\nn by applying the corresponding\n\nobtain samples of xd\nGaussian variables yd\ntransformation, detailed in Section 3.1.\n\n5\n\nZ2B\u21b5YdBdd=1,...,D2yX dHd\fd=1\n\nUpdate Z given {Yd}D\nfor d = 1, . . . , D do\n\nAlgorithm 1 Inference Algorithm.\nInput: X\nInitialize: initialize Z and {Yd}D\n1: for each iteration do\n2:\n3:\n4:\n5:\n6:\n7:\n8: end for\nOutput: Z, {Bd}D\n\nd=1 and {\u03a8d}D\n\nend for\n\nd=1.\n\nd=1\n\nSample Bd given Z and Yd according to (8).\nSample Yd given X, Z and Bd (as shown in the Supplementary Material).\nSample \u03a8d if needed (as shown in the Supplementary Material).\n\nGibbs sampler and, in consequence, is the standard method of choice in the context of the standard\nlinear-Gaussian IBP [8]. However, this algorithm suffers from a high computational cost (being\ncomplexity per iteration cubic with the number of data points N), which is prohibitive when dealing\nwith large databases. In order to solve this limitation, we resort to the accelerated Gibbs sampler [4]\ninstead. This algorithm presents linear complexity with the number of datapoints and is detailed in\nthe Supplementary Material.\nSecond, we need to sample the weighting factors in Bd, which is a K \u00d7 Rd matrix in the case of\ncategorical attributes, and a K-length column vector otherwise. We denote each column vector in\nBd by bd\n\nr. The posterior over the weighting vectors are given by\n\np(bd\n\nr|P\u22121\u03bbd\n\nr, P\u22121),\n\nr , Z) = N (bd\n\nr|yd\nr = Z(cid:62)yd\n\n(8)\n\nBIk and \u03bbd\n\nnr|znbd, \u03c32\n\ny) if the observation xd\n\nr. Note that the covariance matrix P\u22121 depend neither\nwhere P = Z(cid:62)Z + 1/\u03c32\non the dimension d nor on r, so we only need to invert the K \u00d7 K matrix P once at each iteration.\nWe describe in the Supplementary Material how to ef\ufb01ciently compute P after changes in the Z\nmatrix by rank one updates, without the need of computing the matrix product Z(cid:62)Z.\nin Yd from the distribution\nOnce we have updated Z and Bd, we sample each element\nn, zn, bd) spec-\nN (yd\ni\ufb01ed in the Supplementary Material, otherwise. Finally, we sample the auxiliary variables in \u03a8d\nfrom their posterior distribution (detailed in the Supplementary Material) if necessary. This two lat-\nter steps involve, in the worst case, sampling from a doubly truncated univariate normal distribution\n(see the Supplementary Material for further details), for which we make use of the algorithm in [11].\n5 Experimental evaluation\nWe now validate the proposed algorithm for table completion on \ufb01ve real databases, which are\nsummarized in Table 1. The datasets contain different numbers of instances and attributes, which\ncover all the discrete and continuous variables described in Section 3. We compare, in terms of\npredictive log-likelihood, the following methods for table completion:\n\nn is missing, and from the posterior p(yd\n\nnr|xd\n\nsian.\n\ntreats all the attributes in X as Gaussian distributed.\n\n\u2022 The proposed general table completion approach denoted by GIBP (detailed in Section 3).\n\u2022 The standard linear-Gaussian IBP [8] denoted by SIBP, treating all the attributes as Gaus-\n\u2022 The Bayesian probabilistic matrix factorization approach [15] denoted by BPMF, that also\nFor the GIBP, we consider for the real positive and the count data the following transformation,\nthat maps from the real numbers to the real positive numbers, f(x) = log(exp(wx) + 1), where\nw is a user hyper-parameter. Before running the SIBP and the BPMF methods we normalize each\ncolumn in matrix X to have zero-mean and unit-variance. Then, in order to provide estimates for\nthe missing data, we denormalize the inferred Gaussian variable. Additionally, since both the SIBP\nand the BPMF assume continuous observations, when dealing with discrete data, we estimate each\nmissing value as the closest integer value to the (denormalized) Gaussian variable.\n\n6\n\n\fDataset\nStatlog German credit dataset\n[5]\nQSAR biodegradation dataset\n[10]\nInternet usage survey dataset\n[1]\nWine quality Dataset [3]\n\nN\n1,000\n\n1,055\n\n1,006\n\n6,497\n\nNESARC dataset [13]\n\n43,000\n\n55 C\n\nD\n20 (10 C + 4 O\n+ 6 N)\n41 (2 R + 17 P\n+ 4 C + 18 N)\n32 (23 C + 8 O\n+ 1 N)\n12 (11 P + 1 N) Contains the results of physicochemical tests re-\n\nDescription\nCollects information about the credit risks of\nthe applicants.\nContains molecular descriptors of biodegrad-\nable and non-biodegradable chemicals.\nContains the responses of the participants to a\nsurvey related to the usage of internet.\n\nalized to different wines.\nContains the responses of the participants to a\nsurvey related to personality disorders.\n\nTable 1: Description of datasets. \u2018R\u2019 states for real-valued variables, \u2018P\u2019 for positive real-valued\nvariables, \u2018C\u2019 for categorical variables, \u2018O\u2019 for ordinal variables and \u2018N\u2019 for count variables\n\n(a) Statlog.\n\n(b) QSAR biodegradation.\n\n(c) Internet usage survey.\n\n(d) Wine quality.\n\n(e) Nesarc database\n\nFigure 2: Average test log-likelihood per missing datum. The \u2018whiskers\u2019 show a standard deviations\nfrom the average test log-likelihood.\n\nIn Figure 2, we plot the average predictive log-likelihood per missing value as a function of the\npercentage of missing data. Each value in Figure 2 has been obtained by averaging the results in\n20 independent sets where the missing values have been randomly chosen. In Figures 2a and 2b,\nwe cut the plot in 50% because, in these two databases, the discrete attributes present a mode value\nthat is present for more than 80% of the instances. As a consequence, the SIBP and the BPMF\nalgorithms assign probability close to one to the mode, which results in an arti\ufb01cial increase in the\naverage test log-likelihood for larger percentages of missing data. For the BPMF model, we have\nused different numbers of latent features (in particular, 10, 20 and 50), although we only show the\nbest results for each database, speci\ufb01cally, K = 10 for the NESARC and the wine databases, and\nK = 50 for the remainder. Both the GIBP and the SIBP have not inferred a number of (binary)\nlatent features above 25 in any case. Note that in Figure 2e, we only plot the test log-likelihood for\nthe GIBP and the SIBP because the BPMF provides much lower values. As expected, we observe\nin Figure 2 that the average test log-likelihood decreases for the three models when the number of\nmissing values increases (\ufb02at shape of the curves are due to the y-axis scale). In this \ufb01gure, we also\nobserve that the proposed general IBP model outperforms the SIBP and the BPMF for four of the\nthe databases, being the SIBP slightly better for the Internet database. The BPMF model presents\nthe lowest test-log-likelihood in all the databases.\nNow, we analyze the performance of the three models for each kind of discrete and continuous\nvariables. Figure 3 shows average predictive likelihood per missing value for each attribute in the\ntable, i.e., for each dimension in X. In this \ufb01gure we have grouped the dimensions according to the\nkind of data that they contain, showing in the x-axis the number of considered categories for the case\nof categorical and ordinal data. In this \ufb01gure, we observe that the GIBP presents similar performance\n\n7\n\n1020304050\u22126\u22125\u22124\u22123\u22122\u22121% of missing dataLog\u2212likelihood GIBPSIBPBPMF1020304050\u221210\u22128\u22126\u22124\u221220% of missing dataLog\u2212likelihood GIBPSIBPBPMF102030405060708090\u22122.5\u22122\u22121.5\u22121% of missing dataLog\u2212likelihood GIBPSIBPBPMF102030405060708090\u221210\u221250% of missing dataLog\u2212likelihood GIBPSIBPBPMF102030405060708090\u22120.8\u22120.7\u22120.6\u22120.5% of missing dataLog\u2212likelihood GIBPSIBP\ffor all the attributes in the \ufb01ve databases, while for the SIBP and the BPMF models, the test-log-\nlikelihood falls drastically for some of the attributes, being this effect worse in the case of the BPMF\n(it explains the low log-likelihood in Figure 2). This effect is even more evident in Figures 2b\nand 2d. We also observe, in Figures 2 and 3, that both IBP based approaches (the GIBP and the\nSIBP) outperform the BPMF, with the proposed GIBP being the one that best performs across all\nthe databases. We can conclude that, unlike to the BPMF and the GIBP, the GIBP provides accurate\nestimates for the missing data regardless of their discrete or continuous nature.\n\n6 Conclusions\nIn this paper, we have proposed a table completion approach for heterogeneous databases, based on\nan IBP with a generalized likelihood that allows for mixed discrete and continuous data. We have\nthen derived an inference algorithm that scales linearly with the number of observations. Finally, our\nexperimental results over \ufb01ve real databases have shown than the proposed approach outperforms,\nin terms of robustness and accuracy, approaches that treat all the attributes as Gaussian variables.\n\n(a) Statlog.\n\n(b) QSAR biodegradation.\n\n(c) Internet usage survey.\n\n(d) Wine quality.\n\n(e) Nesarc database\n\nFigure 3: Average test log-likelihood per missing datum in each dimension of the data with 50% of\nmissing data. In the x-axis \u2018R\u2019 states for real-valued variables, \u2018P\u2019 for positive real-valued variables,\n\u2018C\u2019 for categorical variables, \u2018O\u2019 for ordinal variables and \u2018N\u2019 for count variables. The number that\naccompanies the \u2018C\u2019 or \u2018O\u2019 corresponds to the number of categories.\nAcknowledgments\nIsabel Valera acknowledge the support of Plan Regional-Programas I+D of Comunidad de Madrid\n(AGES-CM S2010/BMD-2422), Ministerio de Ciencia e Innovaci\u00b4on of Spain (project DEIPRO\nTEC2009-14504-C02-00 and program Consolider-Ingenio 2010 CSD2008-00010 COMONSENS).\nZoubin Ghahramani is supported by the EPSRC grant EP/I036575/1 and a Google Focused Research\nAward.\n\n8\n\nC5C10C5C3C4C3C3C4C2C2O4O5O5O2NNNNNN\u221230\u221220\u2212100AttributeLog\u2212likelihood GIBPSIBPBPMFRRPPPPPPPPPPPPPPPPPC2C2C4C2NNNNNNNNNNNNNNNNNN\u221230\u221220\u221210010AttributeLog\u2212likelihood GIBPSIBPBPMFC3C3C3C3C3C3C4C4C4C5C5C6C6C6C6C6C5C5C3C2C2C2C9O6O7O7O7O7O7O8O6N\u22128\u22126\u22124\u221220AttributeLog\u2212likelihood GIBPSIBPBPMFPPPPPPPPPPPN\u221230\u221220\u221210010AttributeLog\u2212likelihood GIBPSIBPBPMFCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\u221230\u221220\u2212100AttributeLog\u2212likelihood GIBPSIBPBPMF\fCentre.\n\n25th\n\nanniversary\n\nof\n\nthe web.\n\nAvailable\n\non:\n\nReferences\n[1] Pew Research\n\nhttp://www.pewinternet.org/datasets/january-2014-25th-anniversary-of-the-web-omnibus/.\n\n[2] W. Chu and Z. Ghahramani. Gaussian processes for ordinal regression. J. Mach. Learn. Res., 6:1019\u2013\n\n1041, December 2005.\n\n[3] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Modeling wine preferences by\ndata mining from physicochemical properties. Decision Support Systems. Dataset available on:\nhttp://archive.ics.uci.edu/ml/datasets.html, 47(4):547\u2013553, 2009.\n\n[4] F. Doshi-Velez and Z. Ghahramani. Accelerated sampling for the indian buffet process. In Proceedings of\nthe 26th Annual International Conference on Machine Learning, ICML \u201909, pages 273\u2013280, New York,\nNY, USA, 2009. ACM.\n\n[5] J. Eggermont, J. N. Kok, and W. A. Kosters. Genetic programming for data classi\ufb01cation: Partitioning\nthe search space. In In Proceedings of the 2004 Symposium on applied computing (ACM SAC04). Dataset\navailable on: http://archive.ics.uci.edu/ml/datasets.html, pages 1001\u20131005. ACM, 2004.\n\n[6] M. Girolami and S. Rogers. Variational Bayesian multinomial probit regression with Gaussian process\n\npriors. Neural Computation, 18:2006, 2005.\n\n[7] P. Gopalan, F. J. R. Ruiz, R. Ranganath, and D. M. Blei. Bayesian Nonparametric Poisson Factor-\nization for Recommendation Systems. nternational Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), 2014.\n\n[8] T. L. Grif\ufb01ths and Z. Ghahramani. The Indian buffet process: an introduction and review. Journal of\n\nMachine Learning Research, 12:1185\u20131224, 2011.\n\n[9] X.-B. Li. A Bayesian approach for estimating and replacing missing categorical data. J. Data and\n\nInformation Quality, 1(1):3:1\u20133:11, June 2009.\n\n[10] K. Mansouri, T. Ringsted, D. Ballabio, R. Todeschini, and V. Consonni. Quantitative structureactivity\nrelationship models for ready biodegradability of chemicals. Journal of Chemical Information and Mod-\neling. Dataset available on: http://archive.ics.uci.edu/ml/datasets.html.\n\n[11] C. P. Robert. Simulation of truncated normal variables. Statistics and computing, 5(2):121\u2013125, 1995.\n[12] F. J. R. Ruiz, I. Valera, C. Blanco, and F. Perez-Cruz. Bayesian nonparametric modeling of suicide\n\nattempts. Advances in Neural Information Processing Systems, 25:1862\u20131870, 2012.\n\n[13] F. J. R. Ruiz, I. Valera, C. Blanco, and F. Perez-Cruz. Bayesian nonparametric comorbidity anal-\nJournal of Machine Learning Research (To appear). Available on\n\nysis of psychiatric disorders.\nhttp://arxiv.org/pdf/1401.7620v1.pdf, 2013.\n\n[14] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In Advances in Neural Information\n\nProcessing Systems, 2007.\n\n[15] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Markov Chain Monte\nIn Proceedings of the 25th International Conference on Machine Learning, ICML \u201908, pages\n\nCarlo.\n880\u2013887, New York, NY, USA, 2008. ACM.\n\n[16] E. Salazar, M. Cain, E. Darling, S. Mitroff, and L. Carin. Inferring latent structure from mixed real and\ncategorical relational data. In Proceedings of the 29th International Conference on Machine Learning\n(ICML-12), ICML \u201912, pages 1039\u20131046, New York, NY, USA, July 2012. Omnipress.\n\n[17] ScienceDaily. Big data, for better or worse: 90% of world\u2019s data generated over last two years.\n[18] P. Shafto, C. Kemp, Mansinghka V., and Tenenbaum J. B. A probabilistic model of cross-categorization.\n\nCognition, 120(1):1 \u2013 25, 2011.\n\n[19] S. Singh and T. Graepel. Automated probabilistic modelling for relational data. In Proceedings of the\n\nACM of Information and Knowledge Management, CIKM \u201913, New York, NY, USA, 2013. ACM.\n\n[20] M. Titsias. The in\ufb01nite gamma-Poisson feature model. Advances in Neural Information Processing\n\nSystems, 19, 2007.\n\n[21] A. Todeschini, F. Caron, and M. Chavent. Probabilistic low-rank matrix completion with adaptive spectral\nregularization algorithms. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 26, pages 845\u2013853. Curran Associates, Inc.,\nDec. 2013.\n\n[22] C. Wang and D. M. Blei. Collaborative topic modeling for recommending scienti\ufb01c articles.\n\nIn Pro-\nceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\nKDD \u201911, pages 448\u2013456, New York, NY, USA, 2011. ACM.\n\n[23] S. Williamson, C. Wang, K. Heller, and D. Blei. The IBP compound Dirichlet process and its applica-\ntion to focused topic modeling. Proceedings of the 27th Annual International Conference on Machine\nLearning, 2010.\n\n9\n\n\f", "award": [], "sourceid": 603, "authors": [{"given_name": "Isabel", "family_name": "Valera", "institution": "UC3M"}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": "University of Cambridge"}]}