{"title": "Statistical Debugging of Sampled Programs", "book": "Advances in Neural Information Processing Systems", "page_first": 603, "page_last": 610, "abstract": "", "full_text": "Statistical Debugging of Sampled Programs\n\nAlice X. Zheng\n\nEE Division\nUC Berkeley\n\nMichael I. Jordan\n\nCS Division and Department of Statistics\n\nUC Berkeley\n\nalicez@cs.berkeley.edu\n\njordan@cs.berkeley.edu\n\nBen Liblit\nCS Division\nUC Berkeley\n\nAlex Aiken\nCS Division\nUC Berkeley\n\nliblit@cs.berkeley.edu\n\naiken@cs.berkeley.edu\n\nAbstract\n\nWe present a novel strategy for automatically debugging programs given\nsampled data from thousands of actual user runs. Our goal is to pinpoint\nthose features that are most correlated with crashes. This is accomplished\nby maximizing an appropriately de\ufb01ned utility function. It has analogies\nwith intuitive debugging heuristics, and, as we demonstrate, is able to\ndeal with various types of bugs that occur in real programs.\n\n1 Introduction\n\nNo software is perfect, and debugging is a resource-consuming process. Most users take\nsoftware bugs for granted, and willingly run buggy programs every day with little com-\nplaint. In some sense, these user runs of the program are the ideal test suite any software\nengineer could hope for. In an effort to harness the information contained in these \ufb01eld\ntests, companies like Netscape/Mozilla and Microsoft have developed automated, opt-in\nfeedback systems. User crash reports are used to direct debugging efforts toward those\nbugs which seem to affect the most people.\nHowever, we can do much more with the information users may provide. Even if we collect\njust a little bit of information from every user run, successful or not, we may end up with\nenough information to automatically pinpoint the locations of bugs. In earlier work [1] we\npresent a program sampling framework that collects data from users at minimal cost; the\naggregated runs are then analyzed to isolate the bugs. Speci\ufb01cally, we learn a classi\ufb01er\non the data set, regularizing the parameters so that only the few features that are highly\npredictive of the outcome have large non-zero weights.\nOne limitation of this earlier approach is that it uses different methods to deal with different\ntypes of bugs. In this paper, we describe how to design a single classi\ufb01cation utility function\nthat integrates the various debugging heuristics. In particular, determinism of some features\nis a signi\ufb01cant issue in this domain, and an additional penalty term for false positives is\nincluded to deal with this aspect. Furthermore, utility levels, while subjective, are robust:\nwe offer simple guidelines for their selection, and demonstrate that results remain stable\nand strong across a wide range of reasonable parameter settings.\n\n\fWe start by brie\ufb02y describing the program sampling framework in Section 2, and present\nthe feature selection framework in Section 3. The test programs and our data set are de-\nscribed in Section 4, followed by experimental results in Section 5.\n\n2 Program Sampling Framework\n\nOur approach relies on being able to collect information about program behavior at runtime.\nTo avoid paying large costs in time or space, we sparsely sample the program\u2019s runtime\nbehavior. We scatter a large number of checks in the program code, but do not execute all\nof them during any single run. The sampled results are aggregated into counts which no\nlonger contain chronology information but are much more space ef\ufb01cient.\nTo catch certain types of bugs, one asks certain types of questions. For instance, function\ncall return values are good sanity checks which many programmers neglect. Memory cor-\nruption is another common class of bugs, for which we may check whether pointers are\nwithin their prescribed ranges. We add a large set of commonly useful assertions into the\ncode, most of which are wild guesses which may or may not capture interesting behavior.\nAt runtime, the program tosses a coin (with low heads probability) independently for each\nassertion it encounters, and decides whether or not to execute the assertion.\nHowever, while it is not expensive to generate a random coin toss, doing so separately for\neach assertion would incur a very large overhead; the program will run even slower than\njust executing every assertion. The key is to combine coin tosses. Given i.i.d. Bernoulli\nrandom variables with success probability h, the number of trials it takes until the \ufb01rst\nsuccess is a geometric random variable with probability P (n) = (1 \u2212 h)n\u22121h. Instead of\ntossing a Bernoulli coin n times, we can generate a geometric random variable to be used\nas a countdown to the next sample. Each assertion decrements this countdown by 1; when\nit reaches 0, we perform the assertion and generate another geometric random variable.1\nHowever, checking to see if the counter has reached 0 at every assertion is still an expensive\nprocedure. For further code optimization, we analyze each contiguous acyclic code region\n(loops- and recursion-free) at compile time and count the maximum number of assertions\non any path through that region. Whenever possible, the generated code decrements in\nbulk, and takes a fast path that skips over the individual checks within a contiguous code\nregion using just a single check against this maximum threshold.\nSamples are taken in chronological order as the program runs. Useful as it might be, it\nwould take a huge amount of space to record this information. To save space, we instead\nrecord only the counts of how often each assertion is found to be true or false. When the\nprogram \ufb01nishes, these counts, along with the program exit status, are sent back to the\ncentral server for further analysis.\nThe program sampling framework is a non-trivial software analysis effort. Interested read-\ners may refer to [1] for a more thorough treatment of all the subtleties, along with detailed\nanalyses of performance impact at different sampling rates.\n\n3 Classi\ufb01cation and Feature Selection\n\nIn the hopes of catching a wide range of bugs, we add a large number of rather wild guesses\ninto the code. Having cast a much bigger net than what we may need, the next step is to\nidentify the relevant features. Let crashes be labeled with an output of 1, and successes\nlabeled with 0. Knowing the \ufb01nal program exit status (crashed or successful) leaves us in\n\n1The sampling density h controls the tradeoff between runtime overhead and data sparsity. It is\nset to be small enough to have tolerable overhead, which then requires more runs in order to alleviate\nthe effects of sparsity. This is not a problem for large programs like Mozilla and Windows with\nthousands of crash reports a day.\n\n\fa classi\ufb01cation setting. However, our primary goal is that of feature selection [2]. Good\nfeature selection should be corroborated by classi\ufb01cation performance, though in our case,\nwe only care about features that correctly predict one of the two classes. Hence, instead of\nworking in the usual maximum likelihood setting for classi\ufb01cation and regularization, we\nde\ufb01ne and maximize a more appropriate utility function. Ultimately, we will see that the\ntwo are not wholly unrelated.\nIt has been noted that the goals of variable and feature selection do not always coincide\nwith that of classi\ufb01cation [3]. Classi\ufb01cation is but the means to an end. As we demonstrate\nin Section 5, good classi\ufb01cation performance assures the user that the system is working\ncorrectly, but one still has to examine the selected features to see that they make sense.\n\n3.1 Some characteristics of the problem\n\nWe concentrate on isolating the bugs that are caused by the occurrence of a small set of\nfeatures, i.e. assertions that are always true when a crash occurs.2 We want to identify the\npredicate counts that are positively correlated with the program crashing. In contrast, we\ndo not care much about the features that are highly correlated with successes. This makes\nour feature selection an inherently one-sided process.\nDue to sampling effects, it is quite possible that a feature responsible for the ultimate crash\nmay not have been observed in a given run. This is especially true in the case of \u201cquick and\npainless\u201d deaths, where a program crashes very soon after the actual bug occurs. Normally\nthis would be an easy bug to \ufb01nd, because one wouldn\u2019t have to look very far beyond\nthe crashing point at the top of the stack. However, this is a challenge for our approach,\nbecause there may be only a single opportunity to sample the buggy feature before the\nprogram dies. Thus many crashes may have an input feature pro\ufb01le that is very similar to\nthat of a successful run. From the classi\ufb01cation perspective, this means that false negatives\nare quite likely.\nAt the other end of the spectrum, if we are dealing with a deterministic bug3, false positives\nshould have a probability of zero: if the buggy feature is observed to be true, then the\nprogram has to crash; if the program did not crash, then the bug must not have occurred.\nTherefore, for a deterministic bug, any false positives during the training process should\nincur a much larger penalty compared to any false negatives.\n\n3.2 Designing the utility function\n\nLet (x, y) denote a data point, where x is an input vector of non-negative integer counts, and\ny \u2208 {0, 1} is the output label. Let f(x; \u03b8) denote a classi\ufb01er with parameter vector \u03b8. There\nare four possible prediction outcomes: y = 1 and f(x; \u03b8) = 1, y = 0 and f(x; \u03b8) = 0,\ny = 1 and f(x; \u03b8) = 0, and y = 0 and f(x; \u03b8) = 1. The last two cases represent false\nnegative and false positive, respectively. In the general form of utility maximization for\nclassi\ufb01cation (see, e.g., [4]), we can de\ufb01ne separate utility functions for each of the four\ncases, and maximize the sum of the expected utilities:\n\nwhere\n\n\u03b8\n\nEP (Y |x)U(Y, x; \u03b8),\n\nmax\nU(Y, x; \u03b8) = u1(x; \u03b8)Y I{f (x;\u03b8)=1} + u2(x; \u03b8)Y I{f (x;\u03b8)=0}\n+ u3(x; \u03b8)(1 \u2212 Y )I{f (x;\u03b8)=0} + u4(x; \u03b8)(1 \u2212 Y )I{f (x;\u03b8)=1} + v(\u03b8),\n\n(1)\n\n(2)\n\n2There are bugs that are caused by non-occurrence of certain events, such as forgotten initializa-\n\ntions. We do not focus on this type of bugs in this paper.\n\n3A bug is deterministic if it crashes the program every time it is observed. For example, derefer-\nencing a null pointer would crash the program without exception. Note that this notion of determinism\nis data-dependent: it is always predicated on the trial runs that we have seen.\n\n\fand where IW is the indicator function for event W . The ui(x; \u03b8) functions specify the\nutility of each case. v(\u03b8) is a regularization term, and can be interpreted as a prior over the\nclassi\ufb01er parameters \u03b8 in the Bayesian terminology.\nWe can approximate the distribution P (Y |x) simply by its empirical distribution, P (Y =\n1|x; \u03b8) := \u02c6P (Y = 1|x) = y. The actual distribution of input features X is determined by\nthe software under examination, hence it is dif\ufb01cult to specify and highly non-Gaussian.\nThus we need a discriminative classi\ufb01er. Let z = \u03b8T x, where the x vector is now aug-\nmented by a trailing 1 to represent the intercept term.4 We use the logistic function \u00b5(z) to\nmodel the class conditional probability:\n\nP (Y = 1|x)\n\n:= \u00b5(z) = 1/(1 + e\u2212z).\n\n(3)\nThe decision boundary is set to 1/2, so that f(x; \u03b8) = 1 if \u00b5(z) > 1/2, and f(x; \u03b8) = 0\nif \u00b5(z) \u2264 1/2. The regularization term is chosen to be the \u20181 norm of \u03b8, which has the\neffect of driving most \u03b8i\u2019s to zero: v(\u03b8) := \u2212\u03bb|\u03b8|1\ni |\u03b8i|. To slightly simplify the\nformula, we choose the same functional form for u1 and u2, but add an extra penalty term\nfor false positives:\n\n1 = \u2212\u03bbP\n\nu1(x; \u03b8) := u2(x; \u03b8)\nu3(x; \u03b8)\nu4(x; \u03b8)\n\n:= \u03b41(log2 \u00b5(x; \u03b8) + 1)\n:= \u03b42(log2(1 \u2212 \u00b5(x; \u03b8)) + 1)\n:= \u03b42(log2(1 \u2212 \u00b5(x; \u03b8)) + 1) \u2212 \u03b43\u03b8T x .\n\n(4)\n(5)\n(6)\n\nNote that the additive constants do not affect the outcome of the optimization; they merely\nensure that utility at the decision boundary is zero. Also, we can fold any multiplicative\nconstants of the utility functions into \u03b4i, so the base of the log function is freely exchange-\nable. We \ufb01nd that the expected utility function is equivalent to:\n\nE U = \u03b41y log \u00b5 + \u03b42(1 \u2212 y) log(1 \u2212 \u00b5) \u2212 \u03b43\u03b8T x(1 \u2212 y)I{\u00b5>1/2} \u2212 \u03bbk\u03b8k1\n1 .\n\n(7)\nWhen \u03b41 = \u03b42 = 1 and \u03b43 = 0, Eqn. (7) is akin to the Lasso [5] (standard logistic regression\nwith ML parameter estimation and \u20181-norm regularization). In general, this expected utility\nfunction weighs each class separately using \u03b4i, and has an additional penalty term for false\npositives.\nParameter learning is done using stochastic (sub)gradient ascent on the objective function.\nBesides having desirable properties like fast convergence rate and space ef\ufb01ciency, such\non-line methods also improve user privacy. Once the suf\ufb01cient statistics are collected, the\ntrial run can be discarded, thus obviating the need to permanently store any user\u2019s private\ndata on a central server.\nEqn. (7) is concave in \u03b8, but the \u20181 norm and the indicator function are non-differentiable\nat \u03b8i = 0 and \u03b8T x = 0, respectively. This can be handled by subgradient ascent methods5.\nIn practice, we jitter the solution away from the point of non-differentiability by taking a\nvery small step along any subgradient. This means that none of the \u03b8i\u2019s will ever be exactly\nzero. But this does not matter since weights close enough to zero are essentially taken as\nzero. Only the few features with the most positive weights are selected at the end.\n\n3.3 Interpretation of the utility functions\nLet us closely examine the utility functions de\ufb01ned in Eqns. (4)\u2013(6). For the case of Y = 1,\nFig. 1(a) plots the function log2 \u00b5(z) + 1. It is positive when z is positive, and approaches\n4Assuming that the more abnormalities there are, the more likely it is for the program to crash, it\n\nis reasonable to use a classi\ufb01er based on a linear combination of features.\n\n5Subgradients are a generalization of gradients that are also de\ufb01ned at non-differentiable points.\nA subgradient for a convex function is any sublinear function pivoted at that point, and minorizing\nthe entire convex function. For convex (concave) optimization, any subgradient is a feasible descent\n(ascent) direction. For more details, see, e.g., [6].\n\n\fFigure 1: (a) Plot of the true positive indicator function and the utility function log2 \u00b5(z) +\n1. (b) Plot of the true negative indicator function, utility function log2(1 \u2212 \u00b5(z)) + 1, and\nits asymptotic slopes \u2212z/ log 2 and \u2212z/2 log 2.\n1 as z approaches +\u221e. It is a crude but smooth approximation of the indicator function\nfor a true positive, yI{\u00b5>1/2}. On the other hand, when z is negative, the utility function\nis negative, acting as a penalty for false negatives. Similarly, Fig. 1(b) plots the utility\nfunctions for Y = 0. In both cases, the utilify function has an upper bound of 1, so that the\neffect of correct classi\ufb01cations is limited. On the other hand, incorrect classi\ufb01cations are\nundesirable, thus their penalty is an unbounded (but slowly deceasing) negative number.\ndz log2(1\u2212\u00b5(z)+1) = \u2212\u00b5(z)/ log 2, we see that, when z is positive,\nTaking the derivative d\n\u22121 \u2264 \u2212\u00b5(z) \u2264 \u22121/2, so log2(1 \u2212 \u00b5(z)) + 1 is sandwiched between two linear functions\n\u2212z/ log 2 and \u2212z/2 log 2. It starts off being closer to \u2212z/2 log 2, but approaches \u2212z/ log 2\nasymptotically (see Fig. 1(b)). Hence, when the false positive is close to the decision\nboundary, the additional penalty of \u03b8T x = z in Eqn. (6) is larger than the default false\npositive penalty, though the two are asymptotically equivalent.\nLet us turn to the roles of the multiplicative weights. \u03b41 and \u03b42 weigh the relative impor-\ntance of the two classes, and can be used to deal with imbalanced training sets where one\nclass is disproportionately larger than the other [7]. Most of the time a program exits suc-\ncessfully without crashing, so we have to deal with having many more successful runs than\ncrashed runs (see Section 5). Furthermore, since we really only care about predicting class\n1, increasing \u03b41 beyond an equal balance of the two data sets could be bene\ufb01cial for feature\nselection performance. Finally, \u03b43 is the knob of determinism: if the bug is deterministic,\nthen setting \u03b43 to a large value will severely penalize false positives; if the bug is not deter-\nministic, then a small value for \u03b43 affords the necessary slack to accommodate runs which\nshould have failed but did not. As we shall see in Section 5, if the bug is truly deterministic,\nthen the quality of the \ufb01nal features selected will be higher for large \u03b43 values.\nIn a previous paper [1], we outlined some simple feature elimination heuristics that can be\nused in the case of a deterministic bug. hElimination by universal falsehoodi discards\nany counter that is always zero, because it likely represents an assertion that can never\nbe true. This is a very common data preprocessing step. hElimination by lack of failing\nexamplei discards any counter that is zero on all crashes, because what never happens\ncannot have caused the crash. hElimination by successful counterexamplei discards\nany counter that is non-zero on any successful run, because these are assertions that can be\ntrue without a subsequent program failure. In our model, if a feature xi is never positive\nfor any crashes, then its associated weight \u03b8i will only decrease in the maximization pro-\ncess. Thus it will not be selected as a crash-predictive feature. This handles helimination\nby lack of failing examplei. Also, if a heavily weighted feature xi is positive on a suc-\ncessful run in the training set, then the classi\ufb01er is more likely to result in a false positive.\nThe false positive penalty term will then decrease the weight \u03b8i, so that such a feature is\nunlikely to be chosen at the end. Thus utility maximization also handles helimination by\nsuccessful counterexamplei. The model we derive here, then, neatly subsumes the ad\nhoc elimination heuristics used in our earlier work.\n\n\u22122\u22121012\u22122\u22121.5\u22121\u22120.500.511.521{z>0}log2m(z)+1zu(z)(a) Y = 1\u22122\u22121012\u22122\u22121.5\u22121\u22120.500.511.52log2(1\u2212m(z))+11{z<0}\u2212z/ln2\u2212z/2ln2zu(z)(b) Y = 0\f4 Two Case Studies\n\nAs examples, we present two cases studies of C programs with bugs that are at the op-\nposite ends of the determinism spectrum. Our deterministic example is ccrypt, a small\nencryption utility. ccrypt-1.2 is known to contain a bug that involves overwriting exist-\ning \ufb01les. If the user responds to a con\ufb01rmation prompt with EOF rather than yes or no,\nccrypt consistently crashes. Our non-deterministic example is GNU bc-1.06, the Unix\ncommand line calculator tool. We \ufb01nd that feeding bc nine megabytes of random input\ncauses it to crash roughly one time in four while calling malloc() \u2014 a strong indication\nof heap corruption. Such bugs are inherently dif\ufb01cult to \ufb01x because they are inherently\nnon-deterministic: there is no guarantee that a mangled heap will cause a crash soon or\nindeed at all.\nccrypt\u2019s sensitivity to EOF inputs suggests that the problem has something to do with its\ninteractions with standard \ufb01le operations. Thus, randomly sampling function return values\nmay identify key operations close to the bug. Our instrumented program adds instrumen-\ntation after each function call to sample and record the number of times the return value is\nnegative, zero, or positive. There are 570 call sites of interest, for 570 \u00d7 3 = 1710 coun-\nters. In lieu of a large user community, we generate many runs arti\ufb01cially using reasonable\ninputs. Each run uses a randomly selected set of present or absent \ufb01les, randomized com-\nmand line \ufb02ags, and randomized responses to ccrypt prompts including the occasional\nEOF. We have collected 7204 trial runs at a sampling rate of 1/100, 1162 of which result\nin a crash. 6516 (\u2248 90%) of these trial runs are randomly selected for training, and the\nremaining 688 held aside for cross-validation. Out of the 1710 counter features, 1542 are\nconstant across all runs, leaving 168 counters to be considered in the training process.\nIn the case of bc, we are interested in the behavior of all pointers and buffers. All pointers\nand array indices are scalars, hence we compare all pairs of scalar values. At any direct\nassignment to a scalar variable a, we identify all other variables b1, b2, . . . , bn of the same\ntype that are also in scope. We record the number of times that a is found to be greater\nthan, equal to, or less than each bi. Additionally, we compare each pointer to the NULL\nvalue. There are 30150 counters in all, of which 2908 are not constant across all runs. Our\nbc data set consists of 3051 runs with distinct random inputs at a sampling rate of 1/1000.\n2729 of these runs are randomly chosen as training set, 322 for the hold-out set.\n\n5 Experimental Results\n\nWe maximize the utility function in Eqn. (7) using stochastic subgradient ascent with a\nlearning rate of 10\u22125. In order to make the magnitude of the weights \u03b8i comparable to\neach other, the feature values are shifted and scaled to lie between [0, 1], then normalized\nto have unit variance. There are four learning parameters, \u03b41, \u03b42, \u03b43, and \u03bb. Since only their\nrelative scale is important, the regularization parameter \u03bb can be set to some \ufb01xed value\n(we use 0.1). For each setting of \u03b4i, the model is set to run for 60 iterations through the\ntraining set, though the process usually converges much sooner. For bc, this takes roughly\n110 seconds in MATLAB on a 1.8 GHz Pentium 4 CPU with 1 GB of RAM. The smaller\nccrypt dataset requires just under 8 seconds.\nThe values of \u03b41, \u03b42, and \u03b43 can all be set through cross-validation. However, this may\ntake a long time, plus we would like to leave the ultimate control of the values to the users\nof this tool. The more important knobs are \u03b41 and \u03b43: the former controls the relative\nimportance of classi\ufb01cation performance on crashed runs, the latter adjusts the believed\nlevel of determinism of the bug. Here are some guidelines for setting \u03b41 and \u03b43 that we \ufb01nd\nto work well in practice. (1) In order to counter the effects of imbalanced datasets, the ratio\nof \u03b41/\u03b42 should be at least around the range of the ratio of successful to crashed runs. This\nis especially crucial for the ccrypt data set, which contains roughly 32 successful runs\nfor every crash. (2) \u03b43 should not be higher than \u03b41, because it is ultimately more important\n\n\fFigure 2: (a,b) Cross-validation scores for the ccrypt data set; (c,d) Cross-validation\nscores for the bc data set. All scores shown are the maximum over free parameters.\n\nto correctly classify crashes than to not have any false positives.\nAs a performance metric, we look at the hold-out set confusion matrix and de\ufb01ne the score\nas the sum of the percentages of correctly classi\ufb01ed data points for each class. Fig. 2(a)\nshows a plot of cross-validation score (maximum over a number of settings for \u03b42 and\n\u03b43) for the ccrypt data set at various \u03b41 values. It is apparent from the plot that any \u03b41\nvalues in the range of [10, 50] are roughly equivalent in terms of classi\ufb01cation performance.\nSpeci\ufb01cally, for the case of \u03b41 = 30 (which is around the range suggested by Guideline 1\nabove), Fig. 2(b) shows the cross-validation scores plotted against different values for \u03b43.\nIn this case, as long as \u03b43 is in the rough range of [3, 15], the classi\ufb01cation performance\nremains the same.6\nFurthermore, settings for \u03b41 and \u03b43 that are safe for classi\ufb01cation also select high quality\nfeatures for debugging. The \u201csmoking gun\u201d which directly indicates the ccrypt bug is:\n\ntraverse.c:122: xreadline() return value == 0\n\nThis call to xreadline() returns 0 if the input terminal is at EOF. In all of the above\nmentioned safe settings for \u03b41 and \u03b43, this feature is returned as the top feature. The rest\nof the higher ranked features are suf\ufb01cient, but not necessary, conditions for a crash. The\nonly difference is that, in more optimal settings, the separation between the top feature and\nthe rest can be as large as an order of magnitude; in non-optimal settings (classi\ufb01cation\nscore-wise), the separation is smaller.\nFor bc, the classi\ufb01cation results are even less sensitive to the particular settings of \u03b41, \u03b42,\nand \u03b43. (See Fig. 2(c,d).) The classi\ufb01cation score is roughly constant for \u03b41 \u2208 [5, 20], and\nfor a particular value of \u03b41, such as \u03b41 = 5, the value of \u03b43 has little impact on classi\ufb01cation\nperformance. This is to be expected: the bug in bc is non-deterministic, and therefore false\npositives do indeed exist in the training set. Hence any small value for \u03b43 will do.\nAs for the feature selection results for bc, for all reasonable parameter settings (and even\nthose that do not have the best classi\ufb01cation performance), the top features are a group of\ncorrelated counters that all point to the index of an array being abnormally big. Below are\nthe top \ufb01ve features for \u03b41 = 10, \u03b42 = 2, \u03b43 = 1:\n\n1. storage.c:176: more arrays():\n2. storage.c:176: more arrays():\n3. storage.c:176: more arrays():\n4. storage.c:176: more arrays():\n5. storage.c:176: more arrays():\n\nindx > optopt\nindx > opterr\nindx > use math\nindx > quiet\nindx > f count\n\n6In Fig. 2(b), the classi\ufb01cation performance for \u03b41 = 30 and \u03b43 = 0 is deceptively high. In\nthis case, the best \u03b42 value is 5, which offsets the cross-validation score by increasing the number\nof predicted non-crashes, at the expense of worse crash-prediction performance. The top feature\nbecomes a necessary but not suf\ufb01cient condition for a crash \u2013 a false positive-inducing feature! Hence\nthe lesson is that if the bug is believed to be deterministic then \u03b43 should always be positive.\n\n1102030405011.21.41.61.82d1best score(a) ccrypt051015202511.21.41.61.82d3best score(b) ccrypt, d1 = 301510152011.21.41.61.82d1best score(c) bc01234511.21.41.61.82d3best score(d) bc, d1 = 5\fThese features immediately point to line 176 of the \ufb01le storage.c. They also indicate\nthat the variable indx seems to be abnormally big. Indeed, indx is the array index that\nruns over the actual array length, which is contained in the integer variable a count.\nThe program may crash long after the \ufb01rst array bound violation, which means that there\nare many opportunities for the sampling framework to observe the abnormally big value\nof indx. Since there are many comparisons between indx and other integer variables,\nthere is a large set of inter-correlated counters, any subset of which may be picked by our\nalgorithm as the top features. In the training run shown above, the smoking gun of indx\n> a count is ranked number 8. But in general its rank could be much smaller, because\nthe top features already suf\ufb01ce for predicting crashes and pointing us to the right line in the\ncode.\n\n6 Conclusions and Future Work\n\nOur goal is a system that automatically pinpoints the location of bugs in widely deployed\nsoftware. We tackle different types of bugs using a custom-designed utility function with\na \u201cdeterminism level\u201d knob. Our methods are shown to work on two real-world programs,\nand are able to locate the bugs in a range of parameter settings.\nIn the real world, programs contain not just one, but many bugs, which will not be distinctly\nlabeled in the set of crashed runs. It is dif\ufb01cult to tease out the different failure modes\nthrough clustering: it relies on macro-level usage patterns, as opposed to the microscopic\ndifference between failures. In on-going research, we are extending our approach to deal\nwith the problem of multiple bugs in larger programs. We are also working on modifying\nthe program sampling framework to allow denser sampling in more important regions of\nthe code. This should alleviate the sparsity of features while reducing the number of runs\nrequired to yield useful results.\n\nAcknowledgments\n\nThis work was supported in part by ONR MURI Grant N00014-00-1-0637; NASA Grant\nNo. NAG2-1210; NSF Grant Nos. EIA-9802069, CCR-0085949, ACI-9619020, and IIS-\n9988642; DOE Prime Contract No. W-7405-ENG-48 through Memorandum Agreement\nNo. B504962 with LLNL.\n\nReferences\n\n[1] B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling.\n\nIn ACM SIGPLAN PLDI 2003, 2003.\n\n[2] A. Blum and P. Langley. Selection of relevant features and examples in machine learning. Arti-\n\n\ufb01cial Intelligence, 97(1-2):245\u2013271, 1997.\n\n[3] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine\n\nLearning Research, 3:1157\u20131182, March 2003.\n\n[4] E. L. Lehmann. Testing Statistical Hypotheses. John Wiley & Sons, 2nd edition, 1986.\n[5] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer\u2013Verlag,\n\n2001.\n\n[6] J.-B. Hiriart-Urruty and C. Lemarechal. Convex Analysis and Minimization Algorithms, vol-\n\nume II. Springer\u2013Verlag, 1993.\n\n[7] N. Japkowicz and S. Stephen. The class imbalance problem: a systematic study. Intelligent Data\n\nAnalysis Journal, 6(5), November 2002.\n\n\f", "award": [], "sourceid": 2371, "authors": [{"given_name": "Alice", "family_name": "Zheng", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Ben", "family_name": "Liblit", "institution": null}, {"given_name": "Alex", "family_name": "Aiken", "institution": null}]}