{"title": "Localizing Bugs in Program Executions with Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 468, "page_last": 476, "abstract": "We devise a graphical model that supports the process of debugging software by guiding developers to code that is likely to contain defects. The model is trained using execution traces of passing test runs; it reflects the distribution over transitional patterns of code positions. Given a failing test case, the model determines the least likely transitional pattern in the execution trace. The model is designed such that Bayesian inference has a closed-form solution. We evaluate the Bernoulli graph model on data of the software projects AspectJ and Rhino.", "full_text": "Localizing Bugs in Program Executions\n\nwith Graphical Models\n\nLaura Dietz\n\nMax-Planck Institute for Computer Science\n\nSaarbruecken, Germany\n\ndietz@mpi-inf.mpg.de\n\nAndreas Zeller\n\nSaarland University\n\nSaarbruecken, Germany\n\nzeller@cs.uni-saarland.de\n\nValentin Dallmeier\nSaarland University\n\nSaarbruecken, Germany\n\ndallmeier@cs.uni-saarland.de\n\nTobias Scheffer\nPotsdam University\nPotsdam, Germany\n\nscheffer@cs.uni-potsdam.de\n\nAbstract\n\nWe devise a graphical model that supports the process of debugging software by\nguiding developers to code that is likely to contain defects. The model is trained\nusing execution traces of passing test runs; it re\ufb02ects the distribution over tran-\nsitional patterns of code positions. Given a failing test case, the model deter-\nmines the least likely transitional pattern in the execution trace. The model is\ndesigned such that Bayesian inference has a closed-form solution. We evaluate\nthe Bernoulli graph model on data of the software projects AspectJ and Rhino.\n\n1\n\nIntroduction\n\nIn today\u2019s software projects, two types of source code are developed: product and test code. Product\ncode, also referred to as the program, contains all functionality and will be shipped to the customer.\nThe program and its subroutines are supposed to behave according to a speci\ufb01cation. The example\nprogram in Figure 1 (left), is supposed to always return the value 10. It contains a defect in line\nnumber 20, which lets it return a wrong value if the input variable equals \ufb01ve.\nIn addition to product code, developers write test code that consists of small test programs, each\ntesting a single procedure or module for compliance with the speci\ufb01cation. For instance, Figure 1\n(right) shows three test cases, the second of which reveals the defect. Development environments\nprovide support for running test cases automatically and would report failure of the second test case.\nLocalizing defects in complex programs is a dif\ufb01cult problem because the failure of a test case\ncon\ufb01rms only the existence of a defect, not its location.\nWhen a program is executed, its trace through the source code can be recorded. An executed line of\nsource code is identi\ufb01ed by a code position s \u2208 S. The stream of code positions forms the trace t\nof a test case execution. The data that our model analyses consists of a set T of passing test cases\nt. In addition to the passing tests we are given a single trace \u00aft of a failing test case. The passing\ntest traces and the trace of the failing case refer to the same code revision; hence, the semantics of\neach code position remain constant. For the failing test case, the developer is to be provided with a\nranking of code positions according to their likelihood of being defective.\nThe semantics of code positions may change across revisions, and modi\ufb01cations of code may impact\nthe distribution of execution patterns in the modi\ufb01ed as well as other locations of the code. We focus\non the problem of localizing defects within a current code revision. After each defect is localized,\nthe code is typically revised and the semantics of code positions changes. Hence, in this setting, we\n\n1\n\n\fFigure 1: Example with product code (left) and test code (right).\n\ncannot assume that any negative training data\u2014that is, previous failing test cases of the same code\nrevision\u2014are available. For that reason, discriminative models do not lend themselves to our task.\nInstead of representing the results as a ranked list of positions, we envision a tight integration in\ndevelopment environments. For instance, on failure of a test case, the developer could navigate\nbetween predicted locations of the defect, starting with top ranked positions.\nSo far, Tarantula [1] is the standard reference model for localizing defects in execution traces. The\nauthors propose an interface widget for test case results in which a pixel represents a code position.\nThe hue value of the pixel is determined by the number of failing and passing traces that execute this\nposition and correlates with the likelihood that s is faulty [1]. Another approach [2] includes return\nvalues and \ufb02ags for executed code blocks and builds on sensitivity and increase of failure probability.\nThis approach was continued in project Holmes [3] to include information about executed control\n\ufb02ow paths. Andrzejewski et al. [4] extend latent Dirichlet allocation (LDA) [5] to \ufb01nd bug patterns in\nrecorded execution events. Their probabilistic model captures low-signal bug patterns by explaining\npassing executions from a set of usage topics and failing executions from a mix of usage and bug\ntopics. Since a vast amount of data is to be processed, our approach is designed to not require\nestimating latent variables during prediction as is necessary with LDA-based approaches [4].\n\nOutline. Section 2 presents the Bernoulli graph model, a graphical, generative model that explains\nprogram executions. This section\u2019s main result is the closed-form solution for Bayesian inference of\nthe likelihood of a transitional pattern in a test trace given example execution traces. Furthermore,\nwe discuss how to learn hyperparameters and smoothing coef\ufb01cients from other revisions, despite\nthe fragile semantics of code positions. In Section 3, reference methods and simpler probabilistic\nmodels are detailed. Section 4 reports on the prediction performance of the studied models for the\nAspectJ and Rhino development projects. Section 5 concludes.\n\n2 Bernoulli Graph Model\n\nThe Bernoulli graph model is a probabilistic model that generates program execution graphs. In\ncontrast to an execution trace, the graph is a representation of an execution that abstracts from the\nnumber of iterations over code fragments. The model allows for Bayesian inference of the likelihood\nof a transition between code positions within an execution, given previously seen executions.\nThe n-gram execution graph Gt = (Vt, Et, Lt) of an execution t connects vertices Vt by edges\nEt \u2286 Vt \u00d7 Vt. Labeling function Lt : Vt \u2192 S(n\u22121) injectively maps vertices to n \u2212 1-grams of\ncode positions, where S is the alphabet of code positions.\nIn the bigram execution graph, each vertex v represents a code position Lt(v); each arc (u, v)\nindicates that code position Lt(v) has been executed directly after code position Lt(u) at least once\nduring the program execution. In n-gram execution graphs, each vertex v represents a fragment\nLt(v) = s1 . . . sn\u22121 of consecutively executed statements. Vertices u and v can only be connected\nby an arc if the fragments are overlapping in all but the \ufb01rst code position of u and the last code\nposition of v; that is, Lt(u) = s1 . . . sn\u22121 and Lt(v) = s2 . . . sn. Such vertices u and v are\n\n2\n\n10 /** 11 * A procedure containing a defect. 12 * 13 * @param param an arbitrary parameter. 14 * @return 10 15 */ 16 public static int defect (int param) { 17 int i = 0; 18 while (i < 10) { 19 if (param == 5) { 20 return 100; 21 } 22 i++; 23 } 24 return i; 25 } public static class TestDefect extends TestCase { public void testParam1() { assertEquals(10, defect(1)); } /** Failing test case. */ public void testParam5() { assertEquals(10, defect(5)); } public void testParam10() { assertEquals(10, defect(10)); } } \fFigure 2: Expanding vertex \u201c22 18\u201d in the generation of a tri-gram execution graph corresponding\nto the trace at the bottom. Graph before expansion is drawn in black, new parts are drawn in solid\nred.\n\nconnected by an arc if code positions s1 . . . sn are executed consecutively at least once during the\nexecution. For the example program in Figure 1 the tri-gram execution graph is given in Figure 2.\n\nGenerative process. The Bernoulli graph model generates one graph Gm,t = (Vm,t, Em,t, Lm,t)\nper execution t and procedure m. The model starts the graph generation with an initial vertex\nrepresenting a fragment of virtual code positions \u03b5.\nIn each step, it expands a vertex u labeled Lm,t(u) = s1 . . . sn\u22121 that has not yet been expanded;\ne.g., vertex \u201c22 18\u201d in Figure 2. Expansion proceeds by tossing a coin with parameter \u03c8m,s1...sn\nfor each appended code position sn \u2208 S. If the coin toss outcome is positive, an edge to vertex v\nlabeled Lm,t(v) = s2 . . . sn is introduced. If Vm,t does not yet include a vertex v with this labeling,\nit is added at this point. Each vertex is expanded only once. The process terminates if no vertex is\nleft that has been introduced but not yet expanded. Parameters \u03c8m,s1...sn are governed by a Beta\ndistribution with \ufb01xed hyperparameters \u03b1\u03c8 and \u03b2\u03c8. In the following we focus on the generation\nof edges, treating the vertices as observed. Figure 3a) shows a factor graph representation of the\ngenerative process and Algorithm 1 de\ufb01nes the generative process in detail.\nInference. Given a collection Gm of previously seen execution graphs for method m and a\nnew execution Gm = (Vm, Em, Lm), Bayesian inference determines the likelihood p((u, v) \u2208\nEm|Vm,Gm, \u03b1\u03c8, \u03b2\u03c8) of each of the edges (u, v), thus indicating unlikely transitions in the new\nexecution of m represented by execution graph Gm. Since we employ independent models for all\n\nAlgorithm 1 Generative process of the Bernoulli graph model.\n\nfor all procedures m do\n\nfor all s1...sn \u2208 (Sm)n do\nfor all executions t do\n\ndraw \u03c8m,s1...sn \u223c Beta(\u03b1\u03c8, \u03b2\u03c8).\ncreate a new graph Gm,t.\nadd a vertex u labeled \u03b5\u03b5...\u03b5.\ninitialize queue Q = {u}.\nwhile queue Q is not empty do\n\ndequeue u \u2190 Q, with L(u) = s1 . . . sn\u22121.\nfor all sn \u2208 Sm do\nlet v be a vertex with L(v) = s2 . . . sn.\ndraw b \u223c Bernoulli(\u03c8m,s1...sn).\nif b = 1 then\n\nif v /\u2208 Vm,t then\nadd v to Vm,t.\nenqueue v \u2192 Q.\n\nadd arc (u, v) to Em,t.\n\n3\n\ne 1719 2222 18e e18 1917 180 ~ y22 18 1718 170 ~ y22 18 2018 2018 241 ~ y22 18 24 0 ~ y22 18 1818 1817 | 18 | 19 | 22 | 18 | 19 |22 | \u2026. | 22 | 18 | 241 ~ y22 18 19...0 ~ y22 18 2318 23\fFigure 3: Generative models in directed factor graph notation with dashed rectangles indicating\ngates [6].\n\nmethods m, inference can be carried out for each method separately. Since vertices Vm are ob-\nserved, coin parameters \u03a8 are d-separated from each other (cf. Figure 3a). We yield independent\nBeta-Bernoulli models conditioned on the presence of start vertices u. Thus, predictive distributions\nfor presence of edges in future graphs can be derived in closed form (Equation 1) where #G\nu denotes\nthe number of training graphs containing vertices labeled L(u) and #G\n(u,v) denotes the number of\ntraining graphs containing edges between vertices labeled L(u) and L(v). See the appendix for a\ndetailed derivation of Equation 1.\n\np((u, v) \u2208 Em|Vm,Gm, \u03b1\u03c8, \u03b2\u03c8) =\n\n.\n\n(1)\n\n#G\n(u,v) + \u03b1\u03c8\n#G\nu + \u03b1\u03c8 + \u03b2\u03c8\n\nBy de\ufb01nition, an execution graph G for an execution contains a vertex if its label is a substring of\nthe execution\u2019s trace t. Likewise, an edge is contained if an aggregation of the vertex labels is a\nsubstring of t. It follows1 that the predictive distribution can be reformulated as in Equation 2 to\npredict the probability of seeing the code position \u02dcs = sn after a fragment of preceding statements\n\u02dcf = s1 . . . sn\u22121 using the trace representation of an execution. Thus, it is not neccessary to represent\nexecution graphs G explicitly.\n\np(\u02dcs| \u02dcf , T, \u03b1\u03c8, \u03b2\u03c8) =\n\n#{t \u2208 T| \u02dcf \u02dcs \u2208 t} + \u03b1\u03c8\n\n#{t \u2208 T| \u02dcf \u2208 t} + \u03b1\u03c8 + \u03b2\u03c8\n\nEstimating interpolation coef\ufb01cients and hyperparameters. For given hyperparameters and\n\ufb01xed context length n, Equation 2 predicts the likelihood for \u02dcsi following a fragment \u02dcf =\n\u02dcsi\u22121 . . . \u02dcsi\u2212n+1. To avoid sparsity issues while maintaining good expressiveness, we smooth various\ncontext lengths up to N by interpolation.\n\np(\u02dcsi|\u02dcsi\u22121 . . . \u02dcsi\u2212N +1, T, \u03b1\u03c8, \u03b2\u03c8, \u03b8) =\n\np(n|\u03b8) \u00b7 p(\u02dcsi|\u02dcsi\u22121 . . . \u02dcsi\u2212n+1, T, \u03b1\u03c8, \u03b2\u03c8)\n\nWe can learn from different revisions by integrating multiple Bernoulli graphs models in a generative\nprocess, in which coin parameters are not shared across revisions and context lengths n. This process\ngenerates a stream of statements with defect \ufb02ags. We learn hyperparameters \u03b1\u03c8 and \u03b2\u03c8 jointly with\n\u03b8 using an automatically derived Gibbs sampling algorithm [7].\n\nn=1\n\nPredicting defective code positions. Having learned point estimates for \u02c6\u03b1\u03c8, \u02c6\u03b2\u03c8, and \u02c6\u03b8 from other\nrevisions in a leave-one-out fashion, statements \u02dcs are scored by the complementary event of being\nnormal for any preceding fragment \u02dcf.\n\n(cid:17)\n\nscore(\u02dcs) = max\n\n\u02dcf preceding \u02dcs\n\n1 \u2212 p(\u02dcs| \u02dcf , T, \u02c6\u03b1\u03c8, \u02c6\u03b2\u03c8, \u02c6\u03b8)\n\nThe maximum is justi\ufb01ed because an erroneous code line may show its defective behavior only\nin combination with some preceding code fragments, and even a single erroneous combination is\nenough to lead to defective behavior of the software.\n\n1For a set A we denote its cardinality by #A rather than |A| to avoid confusion with conditioned signs.\n\n4\n\nN(cid:88)\n\n(cid:16)\n\n(2)\n\n(3)\n\nfor each procedure mfor eachfragment f Snfor each trace tFragmentcoin\u03c8\u03b1\u03c8\u03b2\u03c8BetaBernoulliftb\u03b1\u03b3Proceduredistr\u03b3\u03b1\u03d5Code Pos. distr\u03d5Symm. Dirichletfor eachcode position in tfor eachtrace tProceduremCode pos.sFragmentf=si-1,...Multinomialm,ffor each procedure mfor each fragment fMultiSymm. Dirichletb) Bernoulli fragmentc) Multinomial n-gramfor each procedure mfor each vertex ufor each code position sEdgecoin\u03c8\u03b1\u03c8\u03b2\u03c8BetauVtruefalseBern(u,v) Ebfalsea) Bernoulli graphfor each graph GEquals\u03b1\u03b2\u03c6\u03c8\u03b3\u03d5\f3 Reference Methods\n\nThe Tarantula model is a popular scoring heuristic for defect localization in software engineering.\nWe will prove a connection between Tarantula and the unigram variant of a Bernoulli graph model.\nFurthermore, we will discuss other reference models which we will consider in the experiments.\n\n3.1 Tarantula\n\nTarantula [1] scores the likelihood of a code position s being defective according to the proportions\nof failing F and passing traces T that execute this position (Equation 4).\n\nscoreT arantula(\u02dcs) =\n\n#{\u00aft\u2208F|\u02dcs\u2208\u00aft}\n\n#{\u00aft\u2208F}\n\n#{\u00aft\u2208F} + #{t\u2208T|\u02dcs\u2208t}\n#{\u00aft\u2208F|\u02dcs\u2208\u00aft}\n#{t\u2208T}\n\n(4)\n\nFor the case that only one test case fails, we can show an interesting relationship between Tarantula,\nthe unigram Bernoulli graph model, and multivariate Bernoulli models (referred to in [8]). In the\nunigram case, the Bernoulli graph model generates a graph in which all statements in an execution\nare directly linked to an empty start vertex. In this case, the Bernoulli graph model is equal to a\nmulti-variate Bernoulli model generating a set of statements for each execution.\nUsing an improper prior \u03b1\u03c8 = \u03b2\u03c8 = 0, the unigram Bernoulli graph model scores a statement by\nscoreGraph(\u02dcs) = 1 \u2212 #{t\u2208T|\u02dcs\u2208t}\n, the rank order of any two code\n#{t\u2208T}\npositions s1, s2 is determined by 1\u2212 g(s1) > 1\u2212 g(s2) or equivalently\n1+g(s2) which is\nTarantula\u2019s ranking criterion if #F is 1.\n\n. Letting g(s) = #{t\u2208T|\u02dcs\u2208t}\n#{t\u2208T}\n\n1+g(s1) >\n\n1\n\n1\n\n3.2 Bernoulli Fragment Model\n\nInspired by this equivalence, we study a naive n-gram extension to multi-variate Bernoulli models\nwhich we call Bernoulli fragment model. Instead of generating a set of statements, the Bernoulli\nmodel may generate a set of fragments for each execution.\nGiven a \ufb01xed order n, the Bernoulli fragment model draws a coin parameter for each possible\nfragment f = s1 . . . sn over the alphabet Sm. For each execution the fragment set is generated by\ntossing a fragment\u2019s coin and including all fragments with outcome b = 1 (cf. Figure 3b). The\nprobability of an unseen fragment \u02dcf is given by p( \u02dcf|T, \u03b1\u03c8, \u03b2\u03c8) = #{t\u2208T| \u02dcf\u2208t}+\u03b1\u03c8\n#{t\u2208T}+\u03b1\u03c8+\u03b2\u03c8\nThe model deviates from reality in that it may generate fragments that may not be aggregateable\ninto a consistent sequence of code positions. Thus, non-zero probability mass is given to impossible\nevents, which is a potential source of inaccuracy.\n\n.\n\n3.3 Multinomial Models\n\nThe multinomial model is popular in the text domain\u2014e.g., [8]. In contrast to the Bernoulli graph\nmodel, the multinomial model takes the number of occurrences of a pattern within an execution into\naccount. It consists of a hierarchical process in which \ufb01rst a procedure m is drawn from multinomial\ndistribution \u03b3, then a code position s is drawn from the multinomial distribution \u03c6m ranging over all\ncode positions Sm in the procedure.\nThe n-gram model is a well-known extension of the unigram multinomial model, where the dis-\ntributions \u03c6 are conditioned on the preceding fragment of code positions f = s1 . . . sn\u22121 to draw\na follow-up statement sn \u223c \u03c6m,f . Using \ufb01xed symmetric Dirichlet distributions with parameter\n\u03b1\u03b3 and \u03b1\u03c6 as priors for the multinomial distributions, the probability for unseen code positions \u02dcs\nfollowing on fragment \u02dcf is given in Equation 5. Shorthand #T\ns\u2208m denotes how often statements in\nprodecure m are executed (summing over all traces t \u2208 T in the training set); and #T\nm,s1...sn denotes\nthe number times statements s1 . . . sn are executed subsequently by procedure m.\n+ \u03b1\u03c6\n\np(\u02dcs, \u02dcm| \u02dcf , T, \u03b1\u03b3, \u03b1\u03c6) \u221d\n\n(5)\n\n(cid:80)\n(cid:124)\n\n#T\nm(cid:48)\u2208M #T\n\ns\u2208 \u02dcm + \u03b1\u03b3\ns\u2208m(cid:48) + \u03b1\u03b3#M\n\n(cid:123)(cid:122)\n\n\u03b3( \u02dcm)\n\n(cid:125)\n\n\u00b7 #T\n(cid:124)\n#T\n\n\u02dcm, \u02dcf\n\n\u02dcm, \u02dcf \u02dcs\n\n(cid:123)(cid:122)\n\n\u03c6 \u02dcm, \u02dcf (\u02dcs)\n\n+ \u03b1\u03c6#S \u02dcm\n\n(cid:125)\n\n5\n\n\f3.4 Holmes\n\nChilimbi et al. [3] propose an approach that relies on a stream of sampled boolean predicates P ,\neach corresponding to an executed control \ufb02ow branch starting at code position s. The approach\nevaluates whether P being true increases the probability of failure in contrast to reaching the code\nposition by chance. Each code position is scored according to the importance of its predicate P\nwhich is the harmonic mean of sensitivity and increase in failure probability. Shorthands Fe(P ) and\nSe(P ) refer to the failing/passing traces that executed the path P , where Fo(P ) and So(P ) refer to\nfailing/passing traces that executed the start point of P .\n\nImportance(P ) =\n\nlog #F\n\nlog Fe(P ) +\n\n2\n\nSe(P )+Fe(P ) \u2212\n\nFe(P )\n\nFo(P )\n\nSo(P )+Fo(P )\n\n(cid:16)\n\n(cid:17)\u22121\n\nThis scoring procedure is not applicable to cases where a path is executed in only one failing trace, as\na division by zero occurs in the \ufb01rst term when Fe(P ) = 1. This issue renders Holmes inapplicable\nto our case study where typically only one test case fails.\n\n3.5 Delta LDA\n\nAndrzejewski et al.\n[4] use a variant of latent Dirichlet Allocation (LDA) [5] to identify topics\nof co-occurring statements. Most topics may be used to explain passing and failing traces, where\nsome topics are reserved to explain statements in the failing traces only. This is obtained by running\nLDA with different Dirichlet priors on passing and failing traces. After inference, the topic speci\ufb01c\nstatement distributions \u03c6 = p(s|z) are converted to p(z|s) via Bayes\u2019 rule. Then statements j are\nranked according to the con\ufb01dence Sij = p(z = i|s = j) \u2212 maxk(cid:54)=i p(z = k|s = j) of being rather\nabout a bug topic i than any other topic k.\n\n4 Experimental Evaluation\n\nIn this section we study empirically how accurately the Bernoulli graph model and the reference\nmodels discussed in Section 3 localize defects that occurred in two large-scale development projects.\nWe \ufb01nd that data used for previous studies is not appropriate for our investigation. The SIR repos-\nitory [9] provides traces of small programs into which defects have been injected. However, as\npointed out in [10], there is no strong argument as to why results obtained on speci\ufb01cally designed\nprograms with arti\ufb01cial defects should necessarily transfer to realistic software development projects\nwith actual defects. The Cooperative Bug Isolation project [11], on the other hand, collects execution\ndata from real applications, but records only a random sample of 1% of the executed code positions;\ncomplete execution traces cannot be reconstructed. Therefore, we use the development history of\ntwo large-scale open source development projects, AspectJ and Rhino, as gathered in [12].\n\nData set. From Rhino\u2019s and AspectJ\u2019s bug database, we select defects which are reproducable by\na test case and identify corresponding revisions in the source code repository. For such revisions,\nthe test code contains a test case that fails in one revision, but passes in the following revision. We\nuse the code positions that were modi\ufb01ed between the two revisions as ground truth for the defective\ncode positions D. For AspectJ, these are one or two lines of code; the Rhino project contains larger\ncode changes. For each such revision, traces T of passing test cases are recorded on a line number\nbasis. In the same manner, the failing trace t (in which the defective code is to be identi\ufb01ed) is\nrecorded.\nThe AspectJ data set consists of 41 defective revisions and a total of 45 failing traces. Each failing\ntrace has a length of up to 2,000,000 executed statements covering approx. 10,000 different code\npositions (of the 75,000 lines in the project), spread across 300 to 600 \ufb01les and 1,000 to 4,000\nprocedures. For each revision, we recorded 100 randomly selected valid test cases (drawn out of\napprox. 1000).\nRhino consists of 15 defective revisions with one failing trace per bug.\nFailing traces\nhave an average length of 3,500,000 executed statements, covering approx. 2,000 of 38,000\n\n6\n\n\fFigure 4: Recall of defective code positions within the 1% highest scored statements for AspectJ\n(top) and Rhino (bottom), for windows of h = 0, h = 1, and h = 10 code lines.\n\ncode positions, spread across 70 \ufb01les and 650 procedures. We randomly selected 100 of\nthe 1500 valid traces for each revision as training data. Both data sets are available at\nhttp://www.mpi-inf.mpg.de/~dietz/debugging.html.\n\nEvaluation criterion. Following the evaluation in [1], we evaluate how well the models are able\nto guide the user into the vicinity of a defective code position. The models return a ranked list of\ncode positions. Envisioning that the developer can navigate from the ranking into the source code\nto inspect a code line within its context, we evaluate the rank k at which a line of code occurs that\nlies within a window of \u00b1h lines of code of a defective line. We plot relative ranks; that is, absolute\nranks divided by the number of covered code lines, corresponding to the fraction of code that the\ndeveloper has to walk through in order to \ufb01nd the defect. We examine the recall@k%, that is the\nfraction of successfully localized defects over the fraction of code the user has to inspect. We expect\na typical developer to inspect the top 0.25% of the ranking, corresponding to approximately 25 ranks\nfor AspectJ.\nNeither the AUC nor the Normalized Discounted Cummulative Gain (NDCG) appropriately measure\nperformance in our application. AUC does not allow for a cut-off rank; NDCG will inappropriately\nreward cases in which many statements in a defect\u2019s vicinity are ranked highly.\n\nReference methods.\nIn order to study the helpfulness of each generative model, we evaluate\nsmoothed models with maximum length N = 5 for each the multinomial, Bernoulli fragment and\nBernoulli graph model. We compare those to the unigram multinomial model and Tarantula. Tuning\nand prediction of reference methods follow in accordance to Section 2. In addition, we compare to\nthe latent variable model Delta LDA with nine usage and one bug topics, \u03b1 = 0.5, \u03b2 = 0.1, and 50\nsampling iterations.\n\nResults. The results are presented in Figure 4. The Bernoulli graph model is always ahead of the\nreference methods that have a closed form solution in the top 0.25% and top 0.5% of the ranking.\nThis improvement is signi\ufb01cant with level 0.05 in comparison to Tarantula for h = 1 and h = 10. It\nis signi\ufb01cantly better than the n-gram multinomial model for h = 1. Although increasing h makes\nthe prediction problem generally easier, only Bernoulli graph and the multinomial n-gram model\nplay to their strength.\nA comparison by Area under the Curve in top 0.25% and top 0.5% indicates that the Bernoulli\ngraph is more than twice as effective as Tarantula for the data sets for h = 1 and h = 10. Using the\n\n7\n\n 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.2 0.4 0.6 0.8 1RecallTop k%AspectJ: h = 0n-gram Bernoulli Graphn-gram Bernoulli Fragmentn-gram MultinomialUnigram MultinomialTarantulaDelta LDA 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.2 0.4 0.6 0.8 1RecallTop k%AspectJ: h = 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.2 0.4 0.6 0.8 1RecallTop k%AspectJ: h = 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.2 0.4 0.6 0.8 1RecallTop k%Rhino: h = 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.2 0.4 0.6 0.8 1RecallTop k%Rhino: h = 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.2 0.4 0.6 0.8 1RecallTop k%Rhino: h = 10\fBernoulli graph model, a developer \ufb01nds nearly every second bug in the top 1% in both data sets,\nwhere ranking a failing trace takes between 10 and 20 seconds.\nAccording to a pair-t-test with 0.05-level, Bernoulli graph\u2019s prediction performance is signi\ufb01cantly\nbetter than Delta LDA for the Rhino data set. No signi\ufb01cant diffference is found for the AspectJ\ndata set, but Delta LDA takes much longer to compute (approx. one hour versus 20 seconds) since\nparameters can not be obtained in closed form but require iterative sampling.\n\nAnalysis. Most revisions in our data sets had bugs that were equally dif\ufb01cult for most of the\nmodels. From revisions where one model drastically outperformed the others we identi\ufb01ed different\ncategories of suspicious code areas. In some cases, the defective procedures were executed in very\nfew or no passing trace; we refer such code as being insuf\ufb01ciently covered. Another category refers\nto defective code lines in the vicinity of branching points such as if-statements. If code before the\nbranch point is executed in many passing traces, but code in one of the branches only rarely, we call\nthis a suspicious branch point.\nThe Bernoulli fragment model treats both kinds of suspicious code areas in a similar way. They\nhave a different effect on the predictive Beta-posteriors in the Bernoulli graph model: insuf\ufb01cient\ncoverage decreases the con\ufb01dence; suspicious branch points will decrease the mean. The Beta-\npriors \u03b1\u03c8 and \u03b2\u03c8 play a crucial role in weighting these two types of potential bugs in the ranking\nand encode prior beliefs on expecting one or the other. Our hyperparameter estimation procedure\nusually selects \u03b1\u03c8 = 1.25 and \u03b2\u03c8 = 1.03 for all context lengths.\nRevisions in which Bernoulli fragment outperformed Bernoulli graph contained defects in insuf\ufb01-\nciently covered areas. Presumably, Bernoulli graph identi\ufb01ed many suspicious branching points, and\nassigned them a higher score. Revisions in which Bernoulli graph outperformed Bernoulli fragment\ncontained bugs at suspicious branching points.\nIn contrast to the Bernoulli-style models, the multinomial models take the number of occurrences of\na code position within a trace into account. Presumably, multiple occurrences of code lines within a\ntrace do not indicate their defectiveness.\n\n5 Conclusions\n\nWe introduced the Bernoulli graph model, a generative model that implements a distribution over\nprogram executions. The Bernoulli graph model generates n-gram execution graphs. Compared\nto execution traces, execution graphs abstract from the number of iterations that sequences of code\npositions have been executed for. The model allows for Bayesian inference of the likelihood of\ntransitional patterns in a new trace, given execution traces of passing test cases. We evaluated the\nmodel and several less complex reference methods with respect to their ability to localize defects\nthat occurred in the development history of AspectJ and Rhino. Our evaluation does not rely on\narti\ufb01cially injected defects.\nWe \ufb01nd that the Bernoulli graph model outperforms Delta LDA on Rhino and performs as good\nas Delta LDA on the AspectJ project, but in substantially less time. Delta LDA is based on a\nmultinomial unigram model, which performs worst in our study. This gives raise to the conjecture\nthat Delta LDA might bene\ufb01t from replacing the multinomial model with a Bernoulli graph model.\nthis conjecture would need to be studied empirically.\nThe Bernoulli graph model outperforms the reference models with closed-form solution with respect\nto giving a high rank to code positions that lie in close vicinity of the actual defect. In order to \ufb01nd\nevery second defect in the release history of Rhino, the Bernoulli graph model walks the developer\nthrough approximately 0.5% of the code positions and 1% in the AspectJ project.\n\nAcknowledgements\n\nLaura Dietz is supported by a scholarship of Microsoft Research Cambridge. Andreas Zeller and\nTobias Scheffer are supported by a Jazz Faculty Grant.\n\n8\n\n\fReferences\n[1] James A. Jones and Mary J. Harrold. Empirical evaluation of the tarantula automatic fault-\nlocalization technique. In Proceedings of the International Conference on Automated Software\nEngineering, 2005.\n\n[2] Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. Jordan. Scalable statis-\ntical bug isolation. In Proceedings of the Conference on Programming Language Design and\nImplementation, 2005.\n\n[3] Trishul Chilimbi, Ben Liblit, Krishna Mehra, Aditya Nori, and Kapil Vaswani. Holmes: Ef-\nfective statistical debugging via ef\ufb01cient path pro\ufb01ling. In Proceedings of the International\nConference on Software Engineering, 2009.\n\n[4] David Andrzejewski, Anne Mulhern, Ben Liblit, and Xiaojin Zhu. Statistical debugging using\nlatent topic models. In Proceedings of the European Conference on Machine Learning, 2007.\n[5] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of\n\nMachine Learning Research, 3:993\u20131022, 2003.\n\n[6] Tom Minka and John Winn. Gates. In Advances in Neural Information Processing Systems,\n\n2008.\n\n[7] Hal Daume III. Hbc: Hierarchical Bayes Compiler. http://hal3.name/HBC, 2007.\n[8] Andrew McCallum and Kamal Nigam. A comparison of event models for Naive Bayes text\nIn Proceedings of the AAAI Workshop on Learning for Text Categorization,\n\nclassi\ufb01cation.\n1998.\n\n[9] Hyunsook Do, Sebastian Elbaum, and Gregg Rothermel. Supporting controlled experimenta-\ntion with testing techniques: An infrastructure and its potential impact. Empirical Software\nEngineering, 10(4):405\u2013435, October 2005.\n\n[10] Lionel C. Briand. A critical analysis of empirical research in software testing. In Proceedings\n\nof the Symposium on Empirical Software Engineering and Measurement, 2007.\n\n[11] Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. Jordan. Public deploy-\nment of cooperative bug isolation. In Proceedings of the Workshop on Remote Analysis and\nMeasurement of Software Systems, 2004.\n\n[12] Valentin Dallmeier and Thomas Zimmermann. Extraction of bug localization benchmarks from\nhistory. In Proceedings of the International Conference on Automated Software Engineering,\n2007.\n\n9\n\n\f", "award": [], "sourceid": 704, "authors": [{"given_name": "Laura", "family_name": "Dietz", "institution": null}, {"given_name": "Valentin", "family_name": "Dallmeier", "institution": null}, {"given_name": "Andreas", "family_name": "Zeller", "institution": null}, {"given_name": "Tobias", "family_name": "Scheffer", "institution": null}]}