{"title": "Latent Variable Models for Predicting File Dependencies in Large-Scale Software Development", "book": "Advances in Neural Information Processing Systems", "page_first": 865, "page_last": 873, "abstract": "When software developers modify one or more files in a large code base, they must also identify and update other related files. Many file dependencies can be detected by mining the development history of the code base: in essence, groups of related files are revealed by the logs of previous workflows. From data of this form, we show how to detect dependent files by solving a problem in binary matrix completion. We explore different latent variable models (LVMs) for this problem, including Bernoulli mixture models, exponential family PCA, restricted Boltzmann machines, and fully Bayesian approaches. We evaluate these models on the development histories of three large, open-source software systems: Mozilla Firefox, Eclipse Subversive, and Gimp. In all of these applications, we find that LVMs improve the performance of related file prediction over current leading methods.", "full_text": "Latent Variable Models for Predicting File\n\nDependencies in Large-Scale Software Development\n\nDiane J. Hu1, Laurens van der Maaten1,2, Youngmin Cho1, Lawrence K. Saul1, Sorin Lerner1\n\n1Dept. of Computer Science & Engineering, University of California, San Diego\n\n2Pattern Recognition & Bioinformatics Lab, Delft University of Technology\n\n{dhu,lvdmaaten,yoc002,saul,lerner}@cs.ucsd.edu\n\nAbstract\n\nWhen software developers modify one or more \ufb01les in a large code base, they\nmust also identify and update other related \ufb01les. Many \ufb01le dependencies can be\ndetected by mining the development history of the code base: in essence, groups\nof related \ufb01les are revealed by the logs of previous work\ufb02ows. From data of this\nform, we show how to detect dependent \ufb01les by solving a problem in binary matrix\ncompletion. We explore different latent variable models (LVMs) for this problem,\nincluding Bernoulli mixture models, exponential family PCA, restricted Boltz-\nmann machines, and fully Bayesian approaches. We evaluate these models on the\ndevelopment histories of three large, open-source software systems: Mozilla Fire-\nfox, Eclipse Subversive, and Gimp. In all of these applications, we \ufb01nd that LVMs\nimprove the performance of related \ufb01le prediction over current leading methods.\n\n1\n\nIntroduction\n\nAs software systems grow in size and complexity, they become more dif\ufb01cult to develop and main-\ntain. Nowadays, it is not uncommon for a code base to contain source \ufb01les in multiple programming\nlanguages, text documents with meta information, XML documents for web interfaces, and even\nplatform-dependent versions of the same application. This complexity creates many challenges be-\ncause no single developer can be an expert in all things.\nOne such challenge arises whenever a developer wishes to update one or more \ufb01les in the code\nbase. Often, seemingly localized changes will require many parts of the code base to be updated.\nUnfortunately, these dependencies can be dif\ufb01cult to detect. Let S denote a set of starter \ufb01les that\nthe developer wishes to modify, and let R denote the set of relevant \ufb01les that require updating after\nmodifying S. In a large system, where the developer cannot possibly be familiar with the entire code\nbase, automated tools that can recommend \ufb01les in R given starter \ufb01les in S are extremely useful.\nA number of automated tools now make recommendations of this sort by mining the development\nhistory of the code base [1, 2]. Work in this area has been facilitated by code versioning systems,\nsuch as CVS or Subversion, which record the development histories of large software projects. In\nthese histories, transactions denote sets of \ufb01les that have been jointly modi\ufb01ed\u2014that is, whose\nchanges have been submitted to the code base within a short time interval. Statistical analyses of\npast transactions can reveal which \ufb01les depend on each other and need to be modi\ufb01ed together.\nIn this paper, we explore the use of latent variable models (LVMs) for modeling the development\nhistory of large code bases. We consider a number of different models, including Bernoulli mixture\nmodels, exponential family PCA, restricted Boltzmann machines, and fully Bayesian approaches.\nIn these models, the problem of recommending relevant \ufb01les can be viewed as a problem in binary\nmatrix completion. We present experimental results on the development histories of three large\nopen-source systems: Mozilla Firefox, Eclipse Subversive, and Gimp. In all of these applications,\nwe \ufb01nd that LVMs outperform the current leading method for mining development histories.\n\n1\n\n\f2 Related work\n\nTwo broad classes of methods are used for identifying \ufb01le dependencies in large code bases; one\nanalyzes the semantic content of the code base while the other analyzes its development history.\n\n2.1\n\nImpact analysis\n\nThe \ufb01eld of impact analysis [3] draws on tools from software engineering in order to identify the\nconsequences of code modi\ufb01cations. Most approaches in this tradition attempt to identify program\ndependencies by inspecting and/or running the program itself. Such dependence-based techniques\ninclude transitive traversal of the call graph as well as static [4, 5, 6] and dynamic [7, 8] slicing\ntechniques. These methods can identify many dependencies; however, they have trouble on cer-\ntain dif\ufb01cult cases such as cross-language dependencies (e.g., between a data con\ufb01guration \ufb01le and\nthe code that uses it) and cross-program dependencies (e.g., between the front and back ends of a\ncompiler). These dif\ufb01culties have led researchers to explore the methods we consider next.\n\n2.2 Mining of development histories\n\nData-driven methods identify \ufb01le dependencies in large software projects by analyzing their devel-\nopment histories. Two of the most widely recognized works in this area are by Ying et al. [1] and\nZimmerman et al. [2]. Both groups use frequent itemset mining (FIM) [9], a general heuristic for\nidentifying frequent patterns in large databases. The patterns extracted from development histories\nare just those sets of \ufb01les that have been jointly modi\ufb01ed at some point in the past; the frequent\npatterns are the patterns that have occurred at least \u03c4 times. The parameter \u03c4 is called the minimum\nsupport threshold. In practice, it is tuned to yield the best possible balance of precision and recall.\nGiven a database and a minimum support threshold, the resulting set of frequent patterns is uniquely\nspeci\ufb01ed. Much work has been devoted to making FIM as fast and ef\ufb01cient as possible. Ying et\nal. [1] uses a FIM algorithm called FP-growth, which extracts frequent patterns by using a tree-like\ndata structure that is cleverly designed to prune the number of possible patterns to be searched. FP-\ngrowth is used to \ufb01nd all frequent patterns that contain the set of starter \ufb01les; the joint sets of these\nfrequent patterns are then returned as recommendations. As a baseline in our experiments we use a\nvariant of FP-growth called FP-Max [10] which outputs only maximal sets for added ef\ufb01ciency.\nZimmerman et al. [2] uses the popular Apriori algorithm [11] (which uses FIM to solve a subtask) to\nform association rules from the development history. These rules are of the form x1 \u2192 x2, where\nx1 and x2 are disjoint sets; they indicate that \u201cif x1 is observed, then based on experience, x2 should\nalso be observed.\u201d After identifying all rules in which starter \ufb01les appear on the left hand side, their\ntool recommends all \ufb01les that appear on the right hand side. They also work with content on a \ufb01ner\ngranularity, recommending not only relevant \ufb01les, but also relevant code blocks within \ufb01les.\nBoth Ying et al. [1] and Zimmerman et al. [2] evaluate the data-driven approach by its f-measure, as\nmeasured against \u201cground-truth\u201d recommendations. For Ying et al. [1], these ground-truth recom-\nmendations are the \ufb01les committed for a completed modi\ufb01cation task, as recorded in that project\u2019s\nBugzilla. For Zimmerman et al. [2], the ground-truth recommendations are the \ufb01les checked-in\ntogether at some point in the past, as revealed by the development history.\nOther researchers have also used the development history to detect \ufb01le dependencies, but in\nmarkedly different ways. Shirabad et al. [12] formulate the problem as one of binary classi\ufb01ca-\ntion; they label pairs of source \ufb01les as relevant or non-relevant based on their joint modi\ufb01cation\nhistories. Robillard [13] analyzes the topology of structural dependencies between \ufb01les at the code-\nblock level. Kagdi et al [14] improve on the accuracy of existing \ufb01le recommendation methods\nby considering asymmetric \ufb01le dependencies; this information is also used to return a partial or-\ndering over recommended \ufb01les. Finally, Sherriff et al. [15] identify clusters of dependent \ufb01les by\nperforming singular value decomposition on the development history.\n\n3 Latent variable modeling of development histories\n\nWe examine four latent variable models of \ufb01le dependence in software systems. All these models\nrepresent the development history as an N \u00d7 D large binary matrix, where non-zero elements in\n\n2\n\n\fthe same row indicate \ufb01les that were checked-in together or jointly modi\ufb01ed at some point in time.\nTo detect dependent \ufb01les, we infer the values of missing elements in this matrix from the values of\nknown elements. The inferences are made from the probability distributions de\ufb01ned by each model.\nWe use the following notation for all models:\n\n1. The \ufb01le list F = (f1, . . . , fD) is an ordered collection of all \ufb01les referenced in a static\n\nversion of the development history.\n\n2. A transaction is a set of \ufb01les that were modi\ufb01ed together, according to the development his-\ntory. We represent each transaction by a D-dimensional binary vector x = (x1, . . . , xD),\nwhere xi = 1 if the fi is a member of the transaction, and xi = 0 otherwise.\n3. A development history D is a set of N transaction vectors {x1, x2, . . . , xN}. We assume\nthem to be independently and identically sampled from some underlying joint distribution.\n4. A starter set is a set of s starter \ufb01les S = (fi1 , . . . , fis ) that the developer wishes to modify.\n5. A recommendation set is a set of recommended \ufb01les R = (fj1, . . . , fjr ) that we label as\n\nrelevant to the starter set S.\n\n3.1 Bernoulli mixture model\n\nThe simplest model that we explore is a Bernoulli mixture model (BMM). Figure 1(a) shows\nthe BMM\u2019s graphical model in plate notation.\nIn training, the observed variables are the D bi-\nnary elements xi \u2208 {0, 1} of each transaction vector. The hidden variable is a multinomial la-\nbel z \u2208 {1, 2, . . . , k} that can be viewed as assigning each transaction vector to one of k clusters.\nThe joint distribution of the BMM is given by:\n\np(x, z|\u03c0, \u00b5) = p(z|\u03c0)\n\np(xi|z, \u00b5) = \u03c0z\n\niz (1 \u2212 \u00b5iz)1\u2212xi.\n\u00b5xi\n\n(1)\n\ni=1\n\ni=1\n\nAs implied by the graph in Fig. 1(a), we model the different elements of x as conditionally inde-\npendent given the label z. Here, the parameter \u03c0z = p(z|\u03c0) denotes the prior probability of the\nlatent variable z, while the parameter \u00b5iz = p(xi = 1|z, \u00b5) denotes the conditional mean of the\nobserved variable xi. We use the EM algorithm to estimate parameters that maximize the likelihood\n\nn p(xn|\u03c0, \u00b5) of the transactions in the development history.\n\np(D|\u03c0, \u00b5) =(cid:81)\n\nWhen a software developer wishes to modify a set of starter \ufb01les, she can query a trained BMM\nto identify a set of relevant \ufb01les. Let s = {xi1, . . . , xis} denote the elements of the transaction\nvector indicating the \ufb01les in the starter set S. Let r denote the D \u2212 s remaining elements of the\ntransaction vector indicating \ufb01les that may or may not be relevant.\nIn BMMs, we infer which\n\ufb01les are relevant by computing the posterior probability p(r|s = 1, \u03c0, \u00b5). Using Bayes rule and\nconditional independence, this posterior probability is given (up to a constant factor) by:\n\nD(cid:89)\n\nD(cid:89)\n\np(r|s = 1, \u03c0, \u00b5) \u221d k(cid:88)\n\np(r|z, \u00b5) p(s = 1|z, \u00b5) p(z|\u03c0).\n\n(2)\n\nz=1\n\nThe most likely set of relevant \ufb01les, according to the model, is given by the completed transaction r\u2217\nthat maximizes the right hand side of eq. (2). Unfortunately, while we can ef\ufb01ciently compute the\nposterior probability p(r|s = 1) for a particular set of recommended \ufb01les, it is not straightforward to\nmaximize eq. (2) over all 2D\u2212s possible ways to complete the transaction. As an approximation, we\nsort the possibly relevant \ufb01les by their individual posterior probabilities p(xi = 1|s = 1) for fi /\u2208 S.\nThen we recommend all \ufb01les whose posterior probabilities p(xi = 1|s = 1) exceed some threshold;\nwe optimize the threshold on a held-out set of training examples.\n\n3.2 Bayesian Bernoulli mixture model\n\nWe also explore a Bayesian treatment of the BMM. In a Bayesian Bernoulli mixture (BBM), instead\nof learning point estimates of the parameters {\u03c0, \u00b5}, we introduce a prior distribution p(\u03c0, \u00b5) and\nmake predictions by averaging over the posterior distribution p(\u03c0, \u00b5|D). The generative model for\nthe BBM is shown graphically in Figure 1(b).\n\n3\n\n\f(a) BMM.\n\n(b) BBM.\n\n(c) RBM.\n\n(d) Logistic PCA.\n\nFigure 1: Graphical model of the Bernoulli mixture model (BMM), the Bayesian Bernoulli mixture\n(BBM), the restricted Boltzmann machine (RBM), and logistic PCA.\n\nIn our BBMs, the mixture weight parameters are drawn from a Dirichlet prior1:\n\np(\u03c0|\u03b1) = Dirichlet (\u03c0 |\u03b1/k, . . . , \u03b1/k ) ,\n\n(3)\nwhere k indicates (as before) the number of mixture components and \u03b1 is a hyperparameter of the\nDirichlet prior, the so-called concentration parameter2. Likewise, the parameters of the k Bernoulli\ndistributions are drawn from Beta priors:\n\np(\u00b5j|\u03b2, \u03b3) = Beta(\u00b5j|\u03b2, \u03b3),\n\n(4)\n\nwhere \u00b5j is a D-dimensional vector, and \u03b2 and \u03b3 are hyperparameters of the Beta prior.\nAs exact inference in BBMs is intractable, we resort to collapsed Gibbs sampling and make pre-\ndictions by averaging over samples from the posterior. In particular, we integrate out the Bernoulli\nparameters \u00b5 and the cluster distribution parameters \u03c0, and we sample the cluster assignment vari-\nables z. For Gibbs sampling, we must compute the conditional probability p(zn = j|z\u2212n,D) that\nthe nth transaction is assigned to cluster j, given the training data D and all other cluster assign-\nments z\u2212n. This probability is given by:\n\n(cid:20) (\u03b2 + N\u2212nij)xni (\u03b3 + N\u2212nj \u2212 N\u2212nij)(1\u2212xni)\n\nD(cid:89)\n\n(cid:21)\n\np(zn = j|z\u2212n,D) =\n\nN\u2212nj + \u03b1\nk\nN \u2212 1 + \u03b1\n\ni=1\n\n\u03b2 + \u03b3 + N\u2212nj\n\n,\n\n(5)\n\nwhere N\u2212nj counts the number of transactions assigned to cluster j (excluding the nth transaction)\nand N\u2212nij counts the number of times that the ith \ufb01le belongs to one of these N\u2212nj transactions.\n\nAfter each full Gibbs sweep, we obtain a sample z(t) (and corresponding counts N (t)\nj of the number\nof points assigned to cluster j), which can be used to infer the Bernoulli parameters \u00b5(t)\nj . We use\nT of these samples to estimate the probability that a \ufb01le xi needs to be changed given \ufb01les in the\nstarter set S. In particular, averaging predictions over the T Gibbs samples, we estimate:\n\n\uf8ee\uf8f0 1\n\nN\n\nT(cid:88)\n\nt=1\n\nk(cid:88)\n\nj=1\n\nN (t)\n\nj\n\n(cid:16)\n(cid:16)\n\np\n\np\n\nxi = 1|\u00b5(t)\ns = 1|\u00b5(t)\n\nj\n\nj\n\n(cid:17)\n(cid:17)\n\n\uf8f9\uf8fb , with \u00b5(t)\n\nj =\n\np(xi = 1|s = 1) \u2248 1\nT\n\n(cid:88)\n\n1\n\nN (t)\n\nj\n\nn:z(t)\n\nn =j\n\nxn.\n\n(6)\n\n3.3 Restricted Boltzmann Machines\n\nA restricted Boltzmann machine (RBM) is a Markov random \ufb01eld (MRF) whose nodes are (typi-\ncally) binary random variables [17]. The graphical model of an RBM is a fully connected bipartite\n1In preliminary experiments, we also investigated an in\ufb01nite mixture of Bernoulli distributions that replaces\nthe Dirichlet prior by a Dirichlet process [16]. However, we did not \ufb01nd the in\ufb01nite mixture model to outper-\nform its \ufb01nite counterpart, so we do not discuss it further.\n\n2For simplicity, we assume a symmetric Dirichlet prior, i.e. we assume \u2200j : \u03b1j = \u03b1/k.\n\n4\n\nxz\u03c0\u03bcNxz\u03c0\u03bc\u03b1\u03b2,\u03b3NKxycNbWxuNV\fexp(cid:0)\u2212x(cid:62)Wy \u2212 b(cid:62)x \u2212 c(cid:62)y(cid:1) ,\n\n1\nZ\n\ngraph with D observed variables xi in one layer and k latent variables yj in the other; see Fig. 1(c).\nDue to the bipartite structure, the latent variables are conditionally independent given the observed\nvariables (and vice versa). For the RBMs in this paper, we model the joint distribution as:\n\np(x, y) =\n\n(7)\nwhere W stores the weight matrix between layers, b and c store (respectively) the biases on ob-\nserved and hidden nodes, and Z is a normalization factor that depends on the model\u2019s parameters.\nThe product form of RBMs can model much sharper distributions over the observed variables than\nmixture models [17], making them an interesting alternative to consider for our application.\nRBMs are trained by maximum likelihood estimation. Exact inference in RBMs is intractable due to\nthe exponential sum in the normalization factor Z. However, the conditional distributions required\nfor Gibbs sampling have a particularly simple form:\n\n(cid:16)(cid:88)\n(cid:16)(cid:88)\n\np(xi = 1|y) = \u03c3\np(yj = 1|x) = \u03c3\n\n(cid:88)\n(cid:88)\n\n(cid:17)\n(cid:17)\n\nWijyj +\n\nj\n\ncj\n\nj\n\n,\n\n(8)\n\ni\n\nWijxi +\n\n(9)\nwhere \u03c3(z) = [1 + e\u2212z]\u22121 is the sigmoid function. The obtained Gibbs samples can be used to\napproximate the gradient of the likelihood function with respect to the model parameters; see [17,\n18] for further discussion of sampling strategies3.\nTo determine whether a \ufb01le fi is relevant given starter \ufb01les in S, we can either (i) clamp the ob-\nserved variables representing starter \ufb01les and perform Gibbs sampling on the rest, or (ii) compute\nthe posterior over the remaining \ufb01les using a fast, factorized approximation [19]. In preliminary\nexperiments, we found the latter to work best. Hence, we recommend \ufb01les by computing\n\nbi\n\n,\n\ni\n\np(xi = 1|s = 1) \u221d exp(bi)\n\n1 + exp\n\nj:fj\u2208S xjWj(cid:96) + Wi(cid:96) + c(cid:96)\n\n,\n\n(10)\n\n(cid:18)\n\nk(cid:89)\n\n(cid:96)=1\n\n(cid:26)(cid:88)\n\n(cid:27)(cid:19)\n\nthen thresholding these probabilities on some value determined on held-out examples.\n\n3.4 Logistic PCA\n\nLogistic PCA is a method for dimensionality reduction of binary data; see Fig. 1(d) for its graph-\nical model. Logistic PCA belongs to a family of algorithms known as exponential family PCA;\nthese algorithms generalize PCA to data modeled by non-Gaussian distributions of the exponential\nfamily [20, 21, 22]. To use logistic PCA, we stack the N transaction vectors xn \u2208 {0, 1}D of the\ndevelopment history into a N \u00d7D binary matrix X. Then, modeling each element of this matrix as\na Bernoulli random variable, we attempt to \ufb01nd a low-rank factorization of the N \u00d7D real-valued\nmatrix \u0398 whose elements are the log-odds parameters of these random variables.\nThe low-rank factorization in logistic PCA is computed by maximizing the log-likelihood of the\nobserved data X. In terms of the log-odds matrix \u0398, this log-likelihood is given by:\n\nLX(\u0398) =\n\n(11)\nWe obtain a low dimensional representation of the data by factoring the log-odds matrix \u0398\u2208(cid:60)N\u00d7D\nas the product of two smaller matrices U\u2208(cid:60)N\u00d7L and V\u2208(cid:60)L\u00d7D. Speci\ufb01cally, we have:\n\nXnd log \u03c3(\u0398nd) + (1 \u2212 Xnd) log \u03c3(\u2212\u0398nd)\n\nnd\n\n.\n\n(cid:20)\n(cid:88)\n\n(cid:21)\n\n\u0398nd =\n\nUn(cid:96)V(cid:96)d.\n\n(12)\n\nNote that the reduced rank L (cid:28) D plays a role analogous to the number of clusters k in BMMs.\nAfter obtaining a low-rank factorization of the log-odds matrix \u0398 = UV, we can use it to rec-\nommend relevant \ufb01les from starter \ufb01les S = {fi1 , fi2, . . . , fis}. To recommend relevant \ufb01les, we\ncompute the vector u that optimizes the regularized log-loss:\nlog \u03c3(u\u00b7vij ) +\n\nLS (u) =\n\ns(cid:88)\n\n(cid:107)u(cid:107)2,\n\n(13)\n\n\u03bb\n2\n\nj=1\n\n(cid:88)\n\n(cid:96)\n\n3We use the approach in [17] known as contrastive divergence with m Gibbs sweeps (CD-m).\n\n5\n\n\fMozilla Firefox\n\nTime Period March 2007 - Nov 2007\nFiles\n1,264\n778\n546\n411\n\nTrain\n9,579\n9,015\n8,497\n8,021\n\nTest\n2,666\n2,266\n1,991\n1,771\n\nSupport\n\n10\n15\n20\n25\n\nEclipse Subversive\nDec 2006 - May 2010\nFiles\nTrain\n61\n372\n316\n38\n30\n282\n233\n25\n\nTest\n114\n92\n79\n59\n\nGimp\n\nNov 2007 - May 2010\nFiles\nTrain\n1,376\n5,359\n5,084\n899\n600\n4,729\n4,469\n447\n\nTest\n3,608\n3,436\n3,208\n3,012\n\nTable 1: Datasets statistics, showing the time period from which transactions were extracted, and\nthe number of transactions and unique \ufb01les in the training and test sets (for a single starter \ufb01le).\n\nwhere in the \ufb01rst term, v(cid:96) denotes the (cid:96)th column of the matrix V, and in the second term, \u03bb is a\nregularization parameter. The vector u obtained in this way is the low dimensional representation\nof the transaction with starter \ufb01les in S. To determine whether \ufb01le fi is relevant, we compute the\nprobability p(xi = 1|u, V) = \u03c3(u\u00b7 vi) and recommend the \ufb01le if this probability exceeds some\nthreshold. (We tune the threshold on held-out transactions from the development history).\n\n4 Experiments\n\nWe evaluated our models on three datasets4 constructed from check-in records of Mozilla Firefox,\nEclipse Subversive, and Gimp. These open-source projects use software con\ufb01guration management\n(SCM) tools which provide logs that allow us to extract binary vectors indicating which \ufb01les were\nchanged during a transaction. Our experimental setup and results are described below.\n\n4.1 Experimental setup\n\nWe preprocess the raw data obtained from SCM\u2019s check-in records in two steps. First, follow-\ning Ying et al [1], we eliminate all transactions consisting of more than 100 \ufb01les (as these usually do\nnot correspond to meaningful changes). Second, we simulate the minimum support threshold (see\nSection 2.2) by removing all \ufb01les in the code base that occur very infrequently. This pruning allows\nus to make a fair comparison with latent variable models (LVMs).\nAfter pre-processing, the dataset is chronologically ordered; the \ufb01rst two-thirds is used as training\ndata, and the last one-third as testing data. For each transaction in the test set, we formed a \u201cquery\u201d\nand \u201clabel\u201d set by randomly picking a set of changed \ufb01les as starter \ufb01les. The remaining \ufb01les that\nwere changed in the transaction form the label set, which is the set of \ufb01les our models must predict.\nFollowing [1], we only include transactions for which the label set is non-empty in the train data.\nTable 1 shows the number of transactions for training and test set, as well as the total number of\nunique \ufb01les that appear in these transactions.\nWe trained the LVMs as follows. The Bernoulli mixture models (BMMs) were trained by 100\nor fewer iterations of the EM algorithm. For the Bayesian mixtures (BBMs), we ran 30 separate\nMarkov chains and made predictions after 30 full Gibbs sweeps5. The RBMs were trained for 300\niterations of contrastive divergence (CD), starting with CD-1 and gradually increasing the number\nof Gibbs sweeps to CD-9 [17]. The parameters U and V of logistic PCA were learned using an\nalternating least squares procedure [21] that converges to a local maximum of the log-likelihood.\nWe initialized the matrices U and V from an SVD of the matrix X.\nThe parameters of the LVMs (i.e., number of hidden components in the BMM and RBM, as well\nas the number of dimensions and the regularization parameter \u03bb in logistic PCA) were selected\nbased on the performance on a small held-out validation set. The hyperparameters of the Bayesian\nBernoulli mixtures were set based on prior knowledge from the domain: the Beta-prior parameters \u03b2\nand \u03b3 were set to 0.005 and 0.95, respectively, to re\ufb02ect our prior knowledge that most \ufb01les are not\nchanged in a transaction. The concentration parameter \u03b1 was set to 50 to re\ufb02ect our prior knowledge\nthat \ufb01le dependencies typically form a large number of small clusters.\n\n4These binary datasets publicly available at http://cseweb.ucsd.edu/\u223cdhu/research/msr\n5In preliminary experiments, we found 30 Gibbs sweeps to be suf\ufb01cient for the Markov chain to mix.\n\n6\n\n\fModel Support\n\nStart = 1\n\nStart = 3\n\nMozilla Firefox\n\nEclipse Subversive\n\nStart = 1\n\nStart = 3\n\nGimp\n\nStart = 1\n\nStart = 3\n\nFIM\n\nBMM\n\nBBM\n\nRBM\n\nLPCA\n\n10\n15\n20\n25\n10\n15\n20\n25\n10\n15\n20\n25\n10\n15\n20\n25\n10\n15\n20\n25\n\n0.020 0.116 0.016 0.176\n0.133 0.382 0.234 0.516\n0.106 0.136 0.112 0.195\n0.014 0.091 0.016 0.159\n0.141 0.461 0.319 0.632\n0.129 0.144 0.127 0.194\n0.007 0.066 0.013 0.129\n0.177 0.550 0.364 0.672\n0.115 0.137 0.106 0.186\n0.006 0.057 0.010 0.095\n0.227 0.616 0.360 0.637\n0.124 0.135 0.110 0.195\n0.222 0.433 0.206 0.479\n0.129 0.177 0.084 0.152\n0.160 0.189 0.106 0.158\n0.181 0.486 0.350 0.489 0.134 0.205 0.085 0.143\n0.160 0.202 0.110 0.141\n0.127 0.207 0.085 0.154\n0.196 0.530 0.403 0.514\n0.172 0.204 0.120 0.147\n0.251 0.566 0.382 0.482\n0.117 0.212 0.010 0.131\n0.177 0.218 0.130 0.160\n0.114 0.174 0.104 0.177\n0.257 0.547 0.278 0.700\n0.196 0.325 0.180 0.376\n0.192 0.340 0.180 0.376\n0.202 0.607 0.374 0.769\n0.114 0.200 0.107 0.183\n0.206 0.355 0.191 0.417 0.223 0.655 0.413 0.791 0.114 0.205 0.108 0.187\n0.110 0.206 0.103 0.179\n0.197 0.360 0.175 0.391 0.262 0.694 0.418 0.756\n0.157 0.230 0.069 0.307\n0.170 0.233 0.090 0.405\n0.074 0.137 0.028 0.194\n0.080 0.148 0.024 0.205\n0.157 0.238 0.138 0.423\n0.156 0.246 0.063 0.310\n0.074 0.156 0.027 0.242\n0.174 0.307 0.178 0.531\n0.169 0.260 0.058 0.324\n0.062 0.143 0.025 0.230\n0.200 0.426 0.259 0.524\n0.172 0.269 0.088 0.340\n0.200 0.249 0.169 0.300\n0.124 0.415 0.230 0.609\n0.123 0.187 0.148 0.263\n0.124 0.200 0.145 0.288\n0.138 0.452 0.281 0.615\n0.182 0.254 0.157 0.295\n0.115 0.222 0.135 0.300\n0.212 0.517 0.325 0.667\n0.182 0.265 0.156 0.308\n0.174 0.277 0.162 0.325\n0.247 0.605 0.344 0.625\n0.100 0.205 0.131 0.230\n\nTable 2: Performance of FIM and LVMs on three datasets for queries with 1 or 3 starter \ufb01les. Each\nshaded column presents the f-measure, and each white column presents the correct prediction ratio.\n\n4.2 Results\n\nOur experiments evaluated the performance of each LVM, as well as a highly ef\ufb01cient implemen-\ntation of FIM called FP-Max [10]. Several experiments were run on different values of starter \ufb01les\n(abbreviated \u201cStart\u201d) and minimum support thresholds (abbreviated \u201cSupport\u201d). Table 2 shows the\ncomparison of each model in terms of the f-measure (the harmonic mean of the precision and re-\ncall) and the \u201ccorrect prediction ratio,\u201d or CPR (the fraction of \ufb01les we predict correctly, assuming\nthat the number of \ufb01les to be predicted is given). The latter measure re\ufb02ects how well our models\nidentify relevant \ufb01les for a particular starter \ufb01le, without the added complication of thresholding.\nExperiments that achieve the highest result for each of the two measures are boldfaced.\nFrom our results, we see that most LVMs outperform the popular FIM approach. In particular, the\nBBMs outperform all other approaches on two of the three datasets, with a high of CPR = 79% in\nEclipse Subversive. This means that an average of 79% of all dependent \ufb01les are detected as relevant\nby the BBM. We also observe that f-measure generally decreases with the addition of starter \ufb01les\n\u2013 since the average size of transactions is relatively small (around four \ufb01les for Firefox), adding\nstarter \ufb01les must make predictions less obvious in the case that the total number of relevant \ufb01les is\nnot given to us. Increasing support, on the other hand, seems to effectively remove noise caused by\ninfrequent \ufb01les. Finally, we see that recommendations are most accurate on Eclipse Subversive, the\nsmallest dataset. We believe this is because a smaller test set does not require a model to predict as\nfar into the future as a larger one. Thus, our results suggest that an online learning algorithm may\nfurther increase accuracy.\n\n5 Discussion\n\nThe use of LVMs has signi\ufb01cant advantages over traditional approaches to impact analysis (see\nSection 2), namely its ability to \ufb01nd dependent \ufb01les written in different languages. To show this, we\npresent the three clusters with the highest weights, as discovered by a BMM in the Firefox data, in\nTable 3. The table reveals that the clusters correspond to interpretable structure in the code that span\nmultiple data formats and languages. The \ufb01rst cluster deals with the JIT compiler for JavaScript,\nwhile the second and third deal with the CSS style sheet manager and web browser properties. The\ndependencies in the last two clusters would have been missed by conventional impact analysis.\n\n7\n\n\fCluster 1\njs/src/jscntxt.h\njs/src/jstracer.cpp\njs/src/nanojit/Assembler.cpp\njs/src/jsregexp.cpp\njs/src/jsapi.cpp\njs/src/jsarray.cpp\njs/src/jsfun.cpp\njs/src/jsinterp.cpp\njs/src/jsnum.cpp\njs/src/jsobj.cpp\n\nCluster 2\nview/src/nsViewManager.cpp\nlayout/generic/nsHTMLRe\ufb02owState.cpp\nlayout/reftests/bugs/reftest.list\nlayout/style/nsCSSRuleProcessor.cpp\nlayout/style/nsCSSStyleSheet.cpp\nlayout/style/nsCSSParser.cpp\nlayout/base/crashtests/crashtests.list\nlayout/base/nsBidiPresUtils.cpp\nlayout/base/nsPresShell.cpp\ncontent/xbl/src/nsBindingManager.cpp\n\nCluster 3\nbrowser/base/content/browser-context.inc\nbrowser/base/content/browser.js\nbrowser/base/content/pageinfo/pageInfo.xul\nbrowser/locales/en-US/chrome/browser/browser.dtd\ntoolkit/mozapps/update/src/nsUpdateService.js.in\ntoolkit/mozapps/update/src/updater/updater.cpp\nmodules/plugin/base/src/nsNPAPIPluginInstance.h\nmodules/plugin/base/src/nsPluginHost.cpp\nbrowser/locales/en-US/chrome/browser/browser.properties\nview/src/nsViewManager.cpp\n\nTable 3: Three of the clusters from Firefox, identi\ufb01ed by the BMM. We show the clusters with\nthe largest mixing proportion. Within each cluster, the 10 \ufb01les with highest membership probabil-\nities are shown; note how these \ufb01les span multiple data formats and program languages, revealing\ndependencies that would escape the notice of traditional methods.\n\nLVMs also have important advantages over FIM. Given a set S of starter \ufb01les, FIM simply looks at\nco-occurrence data; it recommends a set of \ufb01les R for which the number of transactions that contain\nboth R and S is frequent. By contrast, LVMs can exploit higher-order information by discovering\nthe underlying structure of the data. Our results suggest that the ability to leverage such structure\nleads to better predictions. Admittedly, in terms of computation, LVMs have a larger one-time\ntraining cost than the FIM, as we must \ufb01rst train the model or generate and store the Gibbs samples.\nHowever, for a single query, the time required to compute recommendations is comparable to that\nof the FP-Max algorithm we used for FIM.\nThe results from the previous section also revealed signi\ufb01cant differences between the LVMs we\nconsidered. In the majority of our experiments, mixture models (with many mixture components)\nappear to outperform RBMs and logistic PCA. This result suggests that our dataset consists of a large\nnumber of transactions with a number of small, highly interrelated \ufb01les. Modeling such data with a\nproduct of experts such as an RBM is dif\ufb01cult as each individual expert has the ability to \u201cveto\u201d a\nprediction. We tried to resolve this problem by using a sparsity prior on the states of the hidden units\ny to make the RBMs behave more like a mixture model [23], but in preliminary experiments, we\ndid not \ufb01nd this to improve the performance. Another interesting observation is that the Bayesian\ntreatment of the Bernoulli mixture model generally leads to better predictions than a maximum\nlikelihood approach, as it is less susceptible to over\ufb01tting. This advantage is particularly useful in\n\ufb01le dependency prediction which requires models with a large number of mixture components to\nappropriately model data that consists of many small, distinct clusters while having few training\ninstances (i.e., transactions).\n\n6 Conclusion\n\nIn this paper, we have described a new application of binary matrix completion for predicting \ufb01le\ndependencies in software projects. For this application, we investigated the performance of four\ndifferent LVMs and compared our results to that of the widely used of FIM. Our results indicate that\nLVMs can signi\ufb01cantly outperform FIM by exploiting latent, higher-order structure in the data.\nAdmittedly, our present study is still limited in scope, and it is very likely that our results can be\nfurther improved. For instance, results from the Net\ufb02ix competition have shown that blending the\npredictions from various models often leads to better performance [24]. The raw transactions also\ncontain additional information that could be harvested to make more accurate predictions. Such\ninformation includes the identity of users who committed transactions to the code base, as well as\nthe text of actual changes to the source code. It remains a grand challenge to incorporate all the\navailable information from development histories into a probabilistic model for predicting which\n\ufb01les need to be modi\ufb01ed. In future work, we aim to explore discriminative methods for parameter\nestimation, as well as online algorithms for tracking non-stationary trends in the code base.\n\nAcknowledgments\n\nLvdM acknowledges support by the Netherlands Organisation for Scienti\ufb01c Research (grant no.\n680.50.0908) and by EU-FP7 NoE on Social Signal Processing (SSPNet).\n\n8\n\n\fReferences\n[1] A.T.T. Ying, G.C. Murphy, R. Ng, and M.C. Chu-Carroll. Predicting source code changes by mining\n\nchange history. IEEE Transactions on Software Engineering, 30(9):574\u2013586, 2004.\n\n[2] T. Zimmerman, P. Weibgerber, S. Diehl, and A. Zeller. Mining version histories to guide software changes.\n\nProceedings of the 26th International Conference on Software Engineering, pages 563\u2013572, 2004.\n\n[3] R. Arnold and S. Bohner. Software Change Impact Analysis. IEEE Computer Society, 1996.\n[4] M. Weiser. Program slicing. In Proceedings of the 5th International Conference on Software Engineering,\n\npages 439\u2013449, 1981.\n\n[5] S. Horwitz, T. Reps, and D. Binkley. Interprocedural slicing using dependence graphs. ACM Transactions\n\non Programming Languages and Systems, 12(1):26\u201360, 1990.\n\n[6] F. Tip. A survey of program slicing techniques. Journal of Programming Languages, 3:121\u2013189, 1995.\n[7] B. Korel and J. Laski. Dynamic program slicing. Information Processing Letters, 29(3):155\u2013163, 1988.\nIn Proceedings of the 25th\n[8] X. Zhang, R. Gupta, and Y. Zhang. Precise dynamic slicing algorithms.\nInternational Conference on Software Engineering, pages 319\u2013329, 2003.\n\n[9] B. Goethals. Frequent set mining.\n\n377\u2013397, 2005.\n\nIn The Data Mining and Knowledge Discovery Handbook, pages\n\n[10] G. Grahne and J. Zhu. Ef\ufb01ciently using pre\ufb01x-trees in mining frequent itemsets. Proceedings of the 1st\n\nICDM Workshop on Frequent Itemset Mining Implementations, 2003.\n\n[11] M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association\n\nrules. 1997.\n\n[12] J. S. Shirabad, T. C. Lethbridge, and S. Matwin. Mining the maintenance history of a legacy software\nsystem. Proceedings of the 19th International Conference on Software Maintenance, pages 95\u2013104, 2003.\n[13] M. Robillard. Automatic generation of suggestions for program investigation. ACM SIGSOFT Interna-\n\ntional Symposium on Foundations of Software Engineering, 30:11\u201320, 2005.\n\n[14] H. Kagdi, S. Yusaf, and J.I. Maletic. Mining sequences of changed-\ufb01les from version histories. Proc. of\n\nInt. Workshop on Mining Software Repositories, pages 47\u201353, 2006.\n\n[15] M. Sherriff, J.M. Lake, and L. Williams. Empirical software change impact analysis using singular value\n\ndecomposition. International Conference on Software Testing, Veri\ufb01cation, and Validation, 2008.\n\n[16] R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computa-\n\ntional and Graphical Statistics, 9:249\u2013265, 2000.\n\n[17] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14(8):1771\u20131800, 2002.\n\n[18] T. Tieleman. Training Restricted Boltzmann Machines using approximations to the likelihood gradient. In\nProceedings of the International Conference on Machine Learning, volume 25, pages 1064\u20131071, 2008.\n[19] R.R. Salakhutdinov, A. Mnih, and G.E. Hinton. Restricted Boltzmann Machines for collaborative \ufb01ltering.\n\nIn Proceedings of the 24th International Conference on Machine Learning, pages 791\u2013798, 2007.\n\n[20] M. Collins, S. Dasgupta, and R.E. Schapire. A generalization of principal components analysis to the\nIn T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural\n\nexponential family.\nInformation Processing Systems 14, Cambridge, MA, 2002. MIT Press.\n\n[21] A.I. Schein, L.K. Saul, and L.H. Ungar. A generalized linear model for principal component analysis of\nbinary data. In Proceedings of the 9th International Workshop on Arti\ufb01cial Intelligence and Statistics,\n2003.\n\n[22] I. Rish, G. Grabarnik, G. Cecchi, F. Pereira, and G.J. Gordon. Closed-form supervised dimensionality re-\nduction with generalized linear models. In Proceedings of the 25th International Conference on Machine\nlearning, pages 832\u2013839, 2008.\n\n[23] M.A. Ranzato, Y.L. Boureau, and Y. LeCun. Sparse feature learning for deep belief networks. In Advances\n\nin Neural Information Processing Systems, pages 1185\u20131192, 2008.\n\n[24] R.M. Bell and Y. Koren. Lessons from the Net\ufb02ix prize challenge. ACM SIGKDD Explorations Newslet-\n\nter, 9(2):75\u201379, 2007.\n\n9\n\n\f", "award": [], "sourceid": 977, "authors": [{"given_name": "Diane", "family_name": "Hu", "institution": null}, {"given_name": "Laurens", "family_name": "Maaten", "institution": null}, {"given_name": "Youngmin", "family_name": "Cho", "institution": null}, {"given_name": "Sorin", "family_name": "Lerner", "institution": null}, {"given_name": "Lawrence", "family_name": "Saul", "institution": null}]}