Statistical Debugging
Introduction to statistical debugging
In layman terms, statistical debugging is the process of acquiring bug statistics from dynamic runs of a particular piece of code from remote clients. So, if an error occurs, a user can provide the development team backtraces, crash dumps, etc. via the internet to a central debug server for the team's review. Acquisition of debug information allows teams to narrow down where a particular bug exists in the code.
Statistical debugging motivations
Despite our best efforts, bug will escape developer testing and analysis tools, and these bugs will eventually be introduced into production. This is due to the following reasons:
- Dynamic analysis is unsound, and will inevitably not find all existing bugs.
- Static analysis is incomplete, and may report false bugs.
- Software development has limited resource, e.g. time, money, and people
Due the reasons above, sometimes our software ships with unknown, and sometimes known, bugs.
Benefits of statistical debugging
Actual runs of code a great resource for debugging for the following reasons:
- Crowdsource-based testing
- In this case the number of real runs is greater than the number of testing runs.
- Reality-directed debugging
- Real-world runs are the ones that matter most
Practical challenges
The following are some practical challenges that face statistical debugging:
- Complex systems
- Millions of lines of code are deployed by developers
- This code usually contains a mix of controlled and uncontrolled code
- Remote monitoring constraints
- On remote devices, we will have limited disk space, network bandwidth, and power to send bug reports
- Incomplete information
- We must limit performance overhead on deployed applications for statistical debugging
- We must ensure to respect the privacy and security of our users
The approach
Our approach to statistical debugging follows three steps:
- Guess the behaviors that are "potentially interesting"
- To do this, we'll start with a compile-time instrumentation of the program, allowing us to tag and identify any interesting behaviors
- Collect sparse, fair subset of these behaviors
- Here, developers create a generic sampling framework
- This framework encompasses a feedback profile and outcome labels (success vs. failure) for each run
- Finally, developers analyze behavioral changes in successful vs. failing runs to find bugs
- This step is what gives this process the name statistical debugging
A model of behavior
To begin a breakdown of "potentially interesting" behaviors, we can follow these steps:
-
Assume any interesting behavior is expressible as a predicate P on a program state at a particular program point:
- Observation of behavior = observing P
-
Instrument the program to observe each predicate
-
Which predicates should we observe?
Branches are interesting
In the lecture an example is given for a conditional branch where two predicates are identified:
p == 0
p != 0
Through instrumentation of this particular branch, cells are maintained that increment each time a specific predicate is executed for this instrumented branch.
Return values are interesting
Similar to the technique above for conditional branches, we instrument function
calls and their return types, tracking different predicates for each possible
return value - incrementing the cell count for each occurrence of a particular
predicate. In our example for fopen
, we track the following predicates:
n < 0
n > 0
n == 0
What other behaviors are interesting?
The entirely depends on the problems you need to solve, some examples are:
- Number of times each loop runs
- Scalar relationships between variables, e.g.
i < j
,i > 42
- Pointer relationships, e.g.
p == q, p != nullptr
Summarization and reporting
At the end of a particular analysis, we summarize the results for each predicate and report them. This involves conglomerating each array of counts for predicates per branch instruction, e.g.:
branch_17
p == 0
p != 0
call_41
n < 0
n > 0
n == 0
Becomes:
p == 0
p != 0
n < 0
n > 0
n == 0
After the summarization and reporting of our run, we utilize the provided feedback to determine the outcome: "Did we succeed or fail? We also label each predicate with a series of states and in this example, we'll provide the following states:
-
- This denotes that a particular predicate was not observed, neither true nor false0
- This denotes that a particular predicates was observed to befalse
at least once and never observed to betrue
1
- This denotes that a particular predicate was observed to betrue
at least once and never observed to befalse
*
- This denotes that a particular predicate was observed at least once to betrue
and at least once to befalse
In the examples provided in the lectures videos, succeed or fail is
determined by whether or not our assert call is 0
or 1
.
The need for sampling
As is discussed, tracking all predicates within a program can be computationally expensive. From all of the interesting predicates within a program, we need to decide which predicates we'll examine and which we'll ignore. With this in mind, we use the following principles for sampling:
- Randomly - sample random predicates
- Independently - sample random predicates independent of the sampling of other predicates
- Dynamically - sample random predicates, independently, determined at runtime
Why do we do all of the above? To institute fairness and acquire an accurate picture of rare events.
Feedback reports with sampling
Feedback reports per run is a vector of sampled predicate states:
- (
-
,0
,1
,*
)
And a success or failure outcome label. With sampling, we can be certain of what we did observe but we may miss some events. Given enough runs of the instrumented program, our samples begin to approach reality. Common events are seen most often in the feedback, and rare events are seen at a proportionate rate.
Finding causes of bugs
How likely is failure when predicate P is observed to be true?
F(P)
= number of failing runs where P is observed to be trueS(P)
= number of successful runs where P is observed to be trueFailure(P) = F(P) / (F(P) + S(P))
- Example:
F(P) = 20
S(P) = 30
Failure(P) = 20 / 50 = 0.4 probability of failing when P is true
Tracking context and increase
In this slide we introduce context, determining the background chance of failure, regardless of P's value. The following are definitions in the context equation:
F(P observed)
= number of failing runs observing PS(P observed)
= number of successful runs observing PContext(P) = F(P observed) / (F(P observed) + S(P observed))
- Example:
F(P observed) = 40, S(P observed) = 80
Context(P) = 40 / 120 = 0.333...
With a combination of failure and context, we can calculate the increase metric, a true measurement that can correlate a predicate to failing runs. The equation for increase is as follows:
Increase(P) = Failure(P) - Context(P)
Increase is ratio that can determine the likelihood of failure with a predicate P, i.e.:
Increase(P) approaches 1
- high correlation with failing runsIncrease(P) approaches -1
- high correlation with successful runs
A first algorithm
The following algorithm is used to filter and determine predicates that result in a crash. This algorithm is comprised of two steps:
- Discard predicates having
Increase(P) <= 0
- e.g. bystander predicates, predicates correlated with success
- Exact value is sensitive to few observations
- Use lower bound of 95% confidence interval
- Basically filtering out predicates that failed but did not appear often
- Sort remaining predicates by
Increase(P)
- Again, use 95% lower bound
- Likely causes with determinacy metrics
Revised algorithm
A revised algorithm used to determine predicates that result in a crash are as follows:
- Repeat the following steps until no runs are left:
- Compute increase, F(), etc. for all predicates
- Rank the predicates
- Add the top-ranked predicate P to the result list
- Remove P and discard all runs where P is true
- This simulates fixing the bug corresponding to P
- Discard reduces rank of correlated predicates
Ranking
A couple of strategies exist for ranking, we'll discuss some here:
- Ranking by increase - at first glance, useful, however, this often results
in a high increase score but few failing runs. Not useful because of how
rare it is in relation to the number of runs.
- These are known as sub-bug predictors, covering special cases of more general bugs.
- Ranking by failure - the problem here is that there could be many failing
runs but low increase scores.
- These are known as super-bug predictors, covering several different bugs together.
A helpful analogy
To better select bug predictors, we want to use the following concepts:
- Precision - fraction of retrieved instances that are relevant
- Recall - fraction of relevant instances that are retrieved
- Retrieved instances - predicates reported as bug predictors
- Relevant instances - predicates that are actual bug predictors
With regard to the described concepts above, we need to achieve both high precision and high recall.
Combining precision and recall
The increase metric has high precision and low recall, whereas F() has high recall and low precision. A standard solution to combine these is to take the harmonic mean of both metrics:
2 / (1/increase(P) + 1/F(P))
This harmonic mean rewards high scores in both dimensions.