Introduction

This notebook contains my notes for CS 6340: Software Analysis, provided by the Georgia Institute of Technology. All projects, homework, and quiz materials are kept private to abide by the Georgia Institute of Technology Academic Honor Code.

If you have any questions about this notebook, my contact information can be found on my website.

Introduction to Software Analysis

What is program analysis?

Program Analysis is a body of work to automatically discover useful facts about programs and can be classified into three kinds of analysis:

Dynamic (run-time)
Static (compile-time)
Hybrid (combination of dynamic and static)

Dynamic analysis

Dynamic program analysis infers facts about a program by monitoring its runs (executions). Some of examples of well known dynamic analysis tools are as follows:

Array bounds checking - Purify
Data race detection - Eraser
Memory leak detection - Valgrind
Finding likely invariants - Daikon
- An invariant is a program fact that is true for every run of the program

Static analysis

Static program analysis infers facts about of a program by inspecting its code. This can be done on both source code and compiled binaries. Some of examples of well known static analysis tools are as follows:

Suspicious error patters - Lint, FindBugs, Coverity
Memory leak detection - Facebook Infer
Checking API usage rules - Microsoft SLAM
Verifying invariants - ESC / Java

Discovering invariants using dynamic analysis

Dynamic analysis is useful for detecting likely invariants, but has difficulty following programs that has loops or recursion - this can lead to arbitrarily many paths. Dynamic analysis tools like Daikon, while being unable to execute all arbitrary paths of a binary dynamically, can rule out entire classes of invariants by observing a run of a program.

Discovering invariants using static analysis

Static analysis and inspection of the source code of a program enables us with the ability to definitively identify invariants.

Terminology

Control flow graph - a graphical representation of each statement within a program, with nodes being instructions and edges being possible execution paths between instructions.
Abstract state - an abstract state represents variables with a symbolic value
Concrete state - a concrete state assigns values to variables
Termination - the resolution of all paths within a program
Completeness - a program analysis that accepts correct programs and rejects incorrect programs. An incomplete program analysis rejects correct programs
Soundness - a sound analysis accepts bug-free programs and rejects buggy programs. An unsound program analysis accepts buggy programs.

Iterative approximation

Iterative approximation is a static analysis technique that iterates over the paths of a program, constantly updating inferred facts until it arrives at a complete understanding of the necessary values for each variable to reach a desired program path.

Dynamic vs static analysis

The following are statements comparing to the cost and effectiveness of dynamic and static program analysis:

Dynamic
- Cost - Proportional of program's execution time
- Effectiveness - Unsound (may miss errors)
Static
- Cost - Proportional to program's size
- Effectiveness - Incomplete (may report spurious errors)

Advantage of static analysis over dynamic analysis:

Static analysis may achieve soundness - it is possible to design an analysis that does not miss an error in a program, even if some of errors reported may be false positives. Since static analysis does not require the program to be run, the cost to analyze a piece of software is proportional to its code size, not it’s runtime. Since static analysis does not require the program to be run, it can be performed on a machine without requiring that machine to be able to run the code.

Advantage of dynamic analysis over static analysis:

Dynamic analysis may achieve completeness - it is possible to design an analysis that will only report errors that can be triggered in a concrete execution of the program, even if some errors may be undiscovered due to limitations on the number of runs inspected. Since dynamic analysis is performed by observing a concrete run of a program, bugs found using this method are typically easy to report / reproduce.

Undecidability of program properties

Questions like "Is a program point reachable on some input?" are undecidable meaning, we would need an infinite amount of time to conduct the program analysis to discover the answer to the proposed question. Program analysis can be sound and complete, but only if we never terminate. Thus, designing a program analysis is an art in which we balance how thorough the analysis is - the tradeoffs are dictated by the consumer of the analysis.

Consumers of program analysis

Compilers - compilers utilize program analysis to optimize code, removing unnecessary code that results in the same outcome for an invariant.
Software quality tools - tools for testing, debugging, and verification. Software quality tools are used for:
- Finding programming errors
- Proving program invariants
- Generating test cases
- Localizing causes of errors
Integrated development environments (IDEs) - IDEs use program analysis to help programmers:
- understand programs
- refactor programs

Introduction to Software Testing

Software testing checks whether a program implementation agrees with a program specification. Without a specification, there is nothing to test. Testing is a form of consistency checking between implementation and specification.

Automated vs manual testing

Automated testing:
- Finds bugs more quickly
- No need to write tests
- if software changes, no need to maintain tests
Manual testing:
- Efficient test suites
- Potentially better code coverage

Black-box vs white-box testing

Black-box testing:
- Can work with code that cannot be modified
- Does not need to analyze or study code
- Code can be in any format
White-box testing:
- Efficient test suite
- Better code coverage

Pre- and post-conditions

pre-condition - a predicate assumed to be true before a function executes
post-condition - a predicate expected to be true after a function executes, whenever a pre-condition holds

Pre and post-conditions are most useful if they are executable and written in the same programming language as the program under test. These conditions are a special case of assertions. Pre and post-conditions don't have to be precise, otherwise they might become more complex than the program under test.

Using pre and post-conditions

Summarily, we check to see if a pre-condition holds prior to executing a test If not, the test fails. Next we execute the function with the inputs defined by the pre-condition, checking to see if the post-condition is true based upon the output of the function. If we detect deviation from the post-condition, the test fails. Else, the test succeeds and we move onto the next test case.

Code coverage

Code coverage is a metric to quantify the extent to which a program's code is tested by a given test suite. Code coverage is quantified as a percentage of some aspect of the program executed in the tests. Some of the most popular types of code coverage can be found below:

Function coverage - what percentage of functions were called?
Statement coverage - what percentage of statements were executed?
Branch coverage - what percentaged of branches were taken?

Mutation analysis

Mutation analysis is founded on the "competent programmer assumption" - the program is close to right to begin with and the existing bugs are minor. The key idea behind mutation analysis is to test mutants of the program to determine if the test suite is good. If the test suite is good, mutants should fail the tests because their code is incorrectly mutated. If a test suite is unsound, it will accept mutants - we must continue to add test cases to the test suite to distinguish mutants from the original program.

Introduction to Random Testing

Random testing (fuzzing)

Random testing or fuzzing is the practice of feeding random inputs to a program, observing whether or not it behaves correctly. During this observation, we determine if the execution of the program satisfies the given specification or if it doesn't crash from random inputs.

Fuzzing can be viewed as a special case of mutation analysis.

Cuzz: fuzzing thread schedules

Cuzz, the Microsoft multi-threaded application fuzzer, introduces sleep() calls automatically in order to suss our concurrency-related bugs in the program under test. This makes concurrency testing less tedious and error-prone in comparison to a human analysis, and the sleep() calls are introduced systematically before each statement.

This method of fuzzing multi-threaded programs provides us with the worst-case probabilistic guarantee on finding bugs.

Depth of concurrency bugs

Bug depth - the number of ordering constraints a thread schedule has to satisfy to find the bug.
Ordering constraints - the order in which statements have to execute to introduce a bug.

From real-world studies using Cuzz, testers observed that many concurrency-related bugs have a shallow bug depth.

Probabilistic guarantee

Cuzz provides a probabilistic guarantee that a bug will be discovered in the program using the following equation:

n threads
k steps
d bug depth
probability == 1 / (n * (k**(d-1)))

Measured vs worst-case probability

The worst-case guarantee to find a concurrency-related bug with Cuzz is for the hardest-to-find bug of a given depth. If multiple bugs exist, the probability of finding a bug, in general, increases. As the number of threads increases in the program, the odds of triggering a bug increases.

Random testing: pros and cons

Pros:
- Easy to implement
- Good coverage given enough tests
- Works against programs of any format
- Useful for finding security vulnerabilities
Cons:
- Inefficient test suites
- Finds bugs that might be unimportant
- Poor code coverage

Automated Test Generation

Outline

In previous chapters were covered random testing, or fuzzing. In this lesson, we'll cover automated test generation that takes a more directed approach. Some methodologies and examples are:

Systematic testing - Korat is a tool that conducts systematic testing
- This test suite targets programs with linked data structures
Feedback-directed random testing - Randoop is a tool that conducts this type of testing.
- This test suite targets Java Classes and Libraries.

Insights

Procedures have an infinite number of possible test inputs, however, only a finite number of test inputs are worth testing against the program. An insight is that, for test generation, we often find that systematically testing all inputs up to a small size can uncover a majority of the bugs.

Small test case hypothesis

The small test case hypothesis is, if there is any test that causes the program to fail, this is such a small test. If a list function works for lists of length 0 through 3, it probably works for all lists - the function is oblivious to the length.

Korat test generation

Remember, Korat targets applications / libraries with linked data structures. What Korat does is enumerate the types of each member of each data structure, determining what values each Class and Node can contain. Korat then generates test structures for all possible shapes of a given data structure.

Korat's algorithm

The basic algorithm that Korat uses to test programs is as follows:

User selects maximum input size k
Generate all possible inputs up to size k
Discard inputs where the pre-condition for the function is false
Run program on remaining inputs
Check results using the function's post-condition

The general case for binary trees

To calculate the number of possible binary trees for k nodes, the formula is:

(k + 1)^(2k + 1)

An overestimate

The formula above is an overestimation of the number of possible binary trees. Some of the tree shapes provided in the above equation are not even tree-shaped, and many of them are isomorphic - exactly the same shape as other possible combinations. In reality, there are only 9 distinct binary trees with at most 3 nodes.

Determining the precondition

Korat uses a specific technique to cull test inputs by instrumenting the pre-condition of the function under test. Korat adds code to the function's actions, observing the fields accessed by the function. This allows them with the ability to enumerate the precondition - if a field isn't accessed by a precondition the precondition does not depend on the field.

Enumerating tests

Korat enumerates shapes by their associated vectors:

In the initial test candidate, all fields are null
The next shape is generated by:
- Expanding the last field accessed by the pre-condition
- Backtracking if all possibilities for a field are exhausted

The key idea in this methodology is that Korat never expands parts of the test input if it's not examined by the function's pre-condition. Korat also checks for and discards shapes that are isomorphic to previously generated shapes.

Strengths and weaknesses of Korat

Korat is strong when we can enumerate all possibilities, for example a small test case for four nodes, two edges per node. Korat is good for:

Linked data structures
Small, easily specified procedures
Unit testing

Korat is weaker for:

Integers, Floating-point numbers, strings

Korat is also as weak as the pre-condition for a function under test. If the pre-condition is not specific and thorough enough, Korat will generate less-than-useful test conditions.

Overview of Randoop

The problem with uniform random testing is that it creates too many illegal or redundant tests. Randoop aims to randomly create new tests guided by feedback from previously generated tests. Randoop's recipe is as follows:

Build new sequences incrementally, extending past sequences
As soon as a sequence is created, execute it
Use execution results to guide test generation towards sequences that create new object states

Sequence types

Illegal sequence - sequences that crash before the contract is checked, e.g. throw an exception
Redundant sequence - Randoop maintains a set of all objects created in execution of each sequence. A sequence is redundant if an object created during its execution belongs to an object in the maintained set.

Dataflow Analysis

What is dataflow analysis?

Static analysis reasoning about the flow of data within a program
Different kinds of data can be tracked with dataflow analysis:
- Constants
- Variables
- Expressions
Dataflow analysis is used by bug-finding tools and compilers

Soundness, completeness, and termination

It is impossible for a software analysis to achieve all three of these objectives - dataflow analysis sacrifices completeness. Dataflow analysis is sound in that it will report all facts that could occur in actual runs. Dataflow analysis is incomplete in that it may report additional facts that can't occur in actual runs.

Abstracting control-flow conditions

Dataflow analysis abstracts away control-flow conditions that contain non-deterministic choices - it assumes that some conditions are able to evaluate to true or false. It considers all paths possible in actual runs (sound) and maybe paths that are never possible during (incomplete).

Applications of dataflow analysis

Reaching definitions analysis
- This software analysis application using dataflow analysis is aimed at finding the usage of uninitialized variables in the program
Very busy expressions analysis
- This software analysis application using dataflow analysis is aimed at reducing code size
Available expressions analysis
- This software analysis application using dataflow analysis is aimed at avoiding recomputing expressions - useful for compiler design
Live variables analysis
- This software analysis application using dataflow analysis is aimed at allocating registers efficiently for data storage - useful for compiler design

Reaching definitions analysis

The goal of reaching definitions analysis is to determine, for each program point, which assignments have been made and not overwritten when execution reaches that point along some path

Does the reaching definitions analysis always terminate?

Yes, the chaotic iteration algorithm used by RDA always terminates. This is because:

The two operations of RDA are monotonic - meaning the IN and OUT sets never shrink, only grow
Largest the IN and OUT sets can be are all definitions in the program

they cannot grow forever

IN and OUT will stop changing after some iteration

Very busy expressions analysis

The goal of very busy expressions analysis is to identify very busy expression at the exit from the point. An expression is very busy if, no matter what path is taken, the expression is used before any of the variables used in it are redefined.

Available expression analysis

The goal of available expressions analysis is to identify, for each program point, which expressions must already have been computed, and not later modified, on all paths to the program point.

Live variables analysis

The goal of live variables analysis is to identify, for each program point, which variables could be live at the point's exit. A variables is live if there is a path to a use of the variable that doesn't redefine the variable.

Pointer Analysis

In the previous lesson we learned about tracking the flow of primitive data types like integers. In this lesson, we'll learn how to track and analyze more complex types of data, namely pointers, objects, and references.

This sort of analysis is called pointer analysis.

Introducing pointers

Field write - an operation where a value is written to an attribute of a class or structure via a pointer to that object or structure.
Field read - an operation where a value is read from an attribute of a class or structure via a pointer to that object or structure.

Pointer aliasing

Pointer aliasing is a situation in which the same address in memory is referred to in different ways.

Approximation to the rescue

Pointer analysis is hard because there are so many different ways to reference pointer values - the lectures provide the example of a linked list. We must sacrifice some combination of soundness, completeness, and termination in order to have a decidable pointer analysis. We sacrifice completeness, causing us to have false positives, but no false negatives.

What does this mean? In the May-Alias case, a pointer may be aliased or it may not. This creates a false positive because we'll have two determinations about the outcome of a specific line of code - one of these will be a false positive.

There are many sound, approximate algorithms for pointer analysis with varying levels of precision. Most of these algorithms differ in two key aspects:

How to abstract the heap
How to abstract control-flow

Abstracting the heap

Points-to Graph is an abstraction of allocations that occurs within the program. It collapses arrays of allocated chunks into a single node, denoting the existence or multiple chunks within a single array using an asterisk. This abstraction is only concerned with creating a graph of relationships between allocated objects, and creates directed edges based upon these allocations in the source code.

Kinds of statements

The following are the kinds of statements that we must track in order to implement the pointer analysis chaotic iteration algorithm:

Object allocation - v = new ...
Object copy statement - v = v2
Field read - v2 = v.f
Field write - v.f = v2
Array-based field read - v2 = v[*]
Array-based field write - v[*] = v2

Rule for object allocation sites

When an object is allocated, we use a weak re-assignment rule. Essentially, once an object is allocated and assigned to a variable, a new node is created on the graph and the variable points to the new node - if the variable didn't already exist then it is created on the graph.

The weak re-assignment rule comes in when, if a variable already exists and points to an object, we allow the variable to point to the new node recently allocated, accumulating assignments.

Rule for object copy

When an object is copied to a new variable, the new variable just copies all existing edges to nodes of the copied variable. No existing nodes are overwritten, just accumulated.

Rule for field writes

In this example, if a variable points to Node A and another variable points to Node B, when the field of the first variable is assigned to the second variable, an edge is drawn between Node A and Node B. Thus, Node A's field is assigned to be the value of the second variable, or Node B.

Rules for field reads

In this example, we assigned a variable the value of another variable's field. So if the second variable points to Node B, and Node B's field is assigned Node C, then the first variable is assigned Node C (Node B's field).

Classifying pointer analysis algorithms

The following are dimensions upon which pointer analysis algorithms are classified:

Is it flow-sensitive?
Is it context-sensitive?
What heap abstractions scheme is used?
How are aggregate data types modeled?

Flow sensitivity

Flow sensitivity describes how an algorithm models control-flow within a procedure. There are two kinds of flow sensitivity:

Flow-insensitive - weak updates on data, never killing any previously generated facts.
- Suffices for may-alias analysis
Flow-sensitive - strong updates on data, kills previously generated facts when encountering new data
- Required for must-alias analysis

Context sensitivity

Context sensitivity describes how an algorithm models control-flow across procedures. There are two kinds of context sensitivity:

Context-insensitive - analyze each procedure only once, regardless of how many times the program encounters a procedure
Context-sensitive - analyze each procedure possibly multiple times, once per abstract calling context

Heap abstraction

Heap abstraction is the scheme to partition unbounded sets of concrete objects into finitely many abstract objects - represented by oval nodes in our point-to graph representations.

Heap abstraction ensures that pointer analysis terminates. There are various sound schemes, however, the following are determinations about precision and efficiency for different schemes:

Too few abstract objects results in efficient but imprecise analysis
Too many abstract objects results in expensive but precise analysis

Scheme 1: allocation-site based

This scheme of heap abstraction is the same one we covered during this lesson. It tracks one abstract object per allocation site. Allocation sites are identified by:

the new keyword in Java/C++
the malloc() (or equivalent) call in C

Because there are finitely many allocation sites in programs, this scheme results in finitely many abstract objects.

Scheme 2: type based

The allocation-site based scheme can be costly due to the following:

The analysis of large programs
Clients needing a quick turnaround time for analysis
Overly fine granularity of sites unnecessary

In the type based scheme, on abstract object is tracker per type in the program. This results in finitely many types and finitely many abstract objects.

Scheme 3: heap-insensitive

The heap-insensitive scheme tracks a single abstract object representing the entire heap. This is obviously imprecise, but has its uses for programming languages like C where stack-directed pointers derive better pointer analysis. This scheme is unsuitable for languages that make great usage of the heap like Java.

Modeling aggregate data types: records

Three choices exist for the modeling of records, otherwise known as structs or objects:

Field-insensitive - merge all fields of each record object

This makes it difficult for an analysis to track reads and writes to specific attributes of an object or structure

Field-based - merge each field of all record objects

This makes it difficult for an analysis to distinguish between fields of different objects or structures

Field-sensitive - keep each field of each record separate

Constraint Based Analysis

Constraint based analysis is a dominant approach to program analysis and is declarative in nature. Constraint based analysis is concerned with expressing what the analysis computes, not how it computes it. Constraint based analysis is concerned with the specification of the analysis rather than the implementation of the analysis. This type of analysis has the following advantages:

Simplification of design and understanding of analysis
Allows for rapid prototyping
Continuous performance and improvement

Motivation

Designing an efficient program analysis is a challenging task that requires the analyst to define the specification and implementation of the program analysis, e.g.:

Specification - "what"; no null pointer is dereferenced along any path in the program
Implementation - "how"; many design choices exist here:
- forward vs backward traversal of the program
- symbolic vs explicit representation of the program state
- etc.

What is constraint based analysis?

In constraint based analysis, an analyst will define the constrains using a constraint language - the implementation of the program analysis will be automated by the constraint solver.

Benefits of constraint based analysis

The following are some benefits of constraint based analysis:

Separates analysis specification from implementation
- Analysis writer can focus on what rather than how
Yields natural program specifications
- Constraints are usually local, whose conjunctions capture global properties
Enables sophisticated analysis implementations
- Leverage powerful, off-the-shelf solvers for implementation

What is Datalog?

Datalog is:

A declarative logic programming language
Not Turing-complete: subset of Prolog, or SQL with recursion
- Efficient algorithms exist to evaluate Datalog programs

Datalog:

Originated as a query language for deductive databases
Later applied in many other domains:
- software analysis
- data mining
- networking
- security
- knowledge representation
- cloud-computing
- etc
Many implementations are based on Datalog, e.g.:
- Logicblox
- bddbddb
- IRIS
- Paddle

Cloning based inter-procedural analysis

To enable context-sensitivity for program analysis using the methods described in this lecture, one can clone variables on a points-to graph analysis generated by pointer analysis. This prevents us from losing context, conducting inter-procedural analysis.

Through this technique, we achieve context sensitivity because we are inlining procedure calls. This increases our precision, but decreases our scalability.

Type Systems

Type systems are usually specified as part of the programming language and are usually built into compilers or interpreters fro the language. The main purpose of type systems is to reduce the possibility of software bugs by checking for logic errors.

A common type of error detected using type systems analysis is applying operations to operands, e.g. adding an integer to a string in a Java program. Type systems catch these error using a collection of rules, assigning types to different parts of the program, i.e.:

Variables
Expressions
Functions

What is a type?

A type is a set of values. For example, in Java:

int is the set of all integers between -2^31 and (2^31)-1
double is the set of all double-precision floating point numbers
boolean is the set {true, false}

Some more examples related to object-oriented types and functions are as follows:

Foo is the set of all objects of class Foo
List<Integer> is the set of all Lists of Integer objects
- List is a type constructor
- List acts as a function from types to types
int -> int is the set of functions taking an int as a parameter and returning another int

Abstraction

All static analyses use abstraction, representing sets of concrete values as abstract values.

Why?

Without abstraction, we can't directly reason about infinite sets of concrete values - this would not guarantee termination. Abstraction improves performance even in the case of large, finite sets. In type systems, abstractions are called types.

Notation for inference rules

Inference rules have the following form:

If hypothesis is true, then conclusion is true.

Type checking computes via reasoning:

If e1 is an int and e2 is a double, then e1 * e2 is a double

The above statement in translated from English to a type analysis inference rule is as follows:

(e1 : int ^ e2 : double) => e1 * e2 : int

Soundness

Soundness is extremely useful, program type checks lead to no errors at runtime and verifies the absence of a class of errors relating to bad typing. A sound type systems analysis provides a strong guarantee, verifying the property of soundness holds across all executions: "Well-typed programs cannot go wrong."

Soundness comes at a price and can lead to false positives.

Global analysis

Also called top-down analysis, global analysis requires the entire program to be inspected, constructing a model of the environment, required to analyze sub-expressions.

Local analysis

Also called bottom-up analysis, local analysis infers multiple environments from each expression analyzed, combining its inferences at the end of the analysis.

Global vs local analysis

The following is a comparison of global vs local analysis:

Global analysis:
- Usually technically simpler than local analysis
- May need extra work to model environments for unfinished programs
Local analysis:
- More flexible in application
- Technically harder: Need to allow unknown parameters, more side conditions

Flow insensitivity

Type systems are generally flow-insensitive meaning their analysis is independent of the ordering of sub-expressions - analysis results are unaffected by permuting statements.

This attribute of type systems analysis is helpful for the following reasons:

No need for modeling a separate state for each sub-expression
Flow insensitive analyses are often very efficient and scalable

This is nice but, due to these attributes, type systems analysis can also be imprecise.

Flow sensitivity

Dataflow analysis in contrast is an example of flow-sensitive analysis. As rules produce new environments, the analysis of a sub-expression cannot take place until its environment is available. The analysis results in this case are dependent on the order of statements.

Path sensitivity

For path sensitive analysis using a type systems analysis, part of the environment now contains a predicate declaring under what conditions an expression is executed. Each sub-expression has different predicates defined at different decision points.

At points where control paths merge, different paths will still be denoted in the output results / final environment.

Symbolic execution is an example of a path-sensitive analysis. Path-sensitive analyses can be expensive, as there are an exponential number of paths to track. This program analyses are often implemented with backtracking, i.e. the exploration of one path until it ends and then starting another.

Statistical Debugging

Introduction to statistical debugging

In layman terms, statistical debugging is the process of acquiring bug statistics from dynamic runs of a particular piece of code from remote clients. So, if an error occurs, a user can provide the development team backtraces, crash dumps, etc. via the internet to a central debug server for the team's review. Acquisition of debug information allows teams to narrow down where a particular bug exists in the code.

Statistical debugging motivations

Despite our best efforts, bug will escape developer testing and analysis tools, and these bugs will eventually be introduced into production. This is due to the following reasons:

Dynamic analysis is unsound, and will inevitably not find all existing bugs.
Static analysis is incomplete, and may report false bugs.
Software development has limited resource, e.g. time, money, and people

Due the reasons above, sometimes our software ships with unknown, and sometimes known, bugs.

Benefits of statistical debugging

Actual runs of code a great resource for debugging for the following reasons:

Crowdsource-based testing
- In this case the number of real runs is greater than the number of testing runs.
Reality-directed debugging
- Real-world runs are the ones that matter most

Practical challenges

The following are some practical challenges that face statistical debugging:

Complex systems

Millions of lines of code are deployed by developers
This code usually contains a mix of controlled and uncontrolled code

Remote monitoring constraints

On remote devices, we will have limited disk space, network bandwidth, and power to send bug reports

Incomplete information

We must limit performance overhead on deployed applications for statistical debugging
We must ensure to respect the privacy and security of our users

The approach

Our approach to statistical debugging follows three steps:

Guess the behaviors that are "potentially interesting"

To do this, we'll start with a compile-time instrumentation of the program, allowing us to tag and identify any interesting behaviors

Collect sparse, fair subset of these behaviors

Here, developers create a generic sampling framework
This framework encompasses a feedback profile and outcome labels (success vs. failure) for each run

Finally, developers analyze behavioral changes in successful vs. failing runs to find bugs

This step is what gives this process the name statistical debugging

A model of behavior

To begin a breakdown of "potentially interesting" behaviors, we can follow these steps:

Assume any interesting behavior is expressible as a predicate P on a program state at a particular program point:
- Observation of behavior = observing P
Instrument the program to observe each predicate
Which predicates should we observe?

Branches are interesting

In the lecture an example is given for a conditional branch where two predicates are identified:

p == 0
p != 0

Through instrumentation of this particular branch, cells are maintained that increment each time a specific predicate is executed for this instrumented branch.

Return values are interesting

Similar to the technique above for conditional branches, we instrument function calls and their return types, tracking different predicates for each possible return value - incrementing the cell count for each occurrence of a particular predicate. In our example for fopen, we track the following predicates:

n < 0
n > 0
n == 0

What other behaviors are interesting?

The entirely depends on the problems you need to solve, some examples are:

Number of times each loop runs
Scalar relationships between variables, e.g. i < j, i > 42
Pointer relationships, e.g. p == q, p != nullptr

Summarization and reporting

At the end of a particular analysis, we summarize the results for each predicate and report them. This involves conglomerating each array of counts for predicates per branch instruction, e.g.:

branch_17
- p == 0
- p != 0
call_41
- n < 0
- n > 0
- n == 0

Becomes:

p == 0
p != 0
n < 0
n > 0
n == 0

After the summarization and reporting of our run, we utilize the provided feedback to determine the outcome: "Did we succeed or fail? We also label each predicate with a series of states and in this example, we'll provide the following states:

- - This denotes that a particular predicate was not observed, neither true nor false
0 - This denotes that a particular predicates was observed to be false at least once and never observed to be true
1 - This denotes that a particular predicate was observed to be true at least once and never observed to be false
* - This denotes that a particular predicate was observed at least once to be true and at least once to be false

In the examples provided in the lectures videos, succeed or fail is determined by whether or not our assert call is 0 or 1.

The need for sampling

As is discussed, tracking all predicates within a program can be computationally expensive. From all of the interesting predicates within a program, we need to decide which predicates we'll examine and which we'll ignore. With this in mind, we use the following principles for sampling:

Randomly - sample random predicates
Independently - sample random predicates independent of the sampling of other predicates
Dynamically - sample random predicates, independently, determined at runtime

Why do we do all of the above? To institute fairness and acquire an accurate picture of rare events.

Feedback reports with sampling

Feedback reports per run is a vector of sampled predicate states:

(-, 0, 1, *)

And a success or failure outcome label. With sampling, we can be certain of what we did observe but we may miss some events. Given enough runs of the instrumented program, our samples begin to approach reality. Common events are seen most often in the feedback, and rare events are seen at a proportionate rate.

Finding causes of bugs

How likely is failure when predicate P is observed to be true?

F(P) = number of failing runs where P is observed to be true
S(P) = number of successful runs where P is observed to be true
Failure(P) = F(P) / (F(P) + S(P))
Example:
- F(P) = 20
- S(P) = 30
- Failure(P) = 20 / 50 = 0.4 probability of failing when P is true

Tracking context and increase

In this slide we introduce context, determining the background chance of failure, regardless of P's value. The following are definitions in the context equation:

F(P observed) = number of failing runs observing P
S(P observed) = number of successful runs observing P
Context(P) = F(P observed) / (F(P observed) + S(P observed))
Example: F(P observed) = 40, S(P observed) = 80
- Context(P) = 40 / 120 = 0.333...

With a combination of failure and context, we can calculate the increase metric, a true measurement that can correlate a predicate to failing runs. The equation for increase is as follows:

Increase(P) = Failure(P) - Context(P)

Increase is ratio that can determine the likelihood of failure with a predicate P, i.e.:

Increase(P) approaches 1 - high correlation with failing runs
Increase(P) approaches -1 - high correlation with successful runs

A first algorithm

The following algorithm is used to filter and determine predicates that result in a crash. This algorithm is comprised of two steps:

Discard predicates having Increase(P) <= 0

e.g. bystander predicates, predicates correlated with success
Exact value is sensitive to few observations
Use lower bound of 95% confidence interval
- Basically filtering out predicates that failed but did not appear often

Sort remaining predicates by Increase(P)

Again, use 95% lower bound
Likely causes with determinacy metrics

Revised algorithm

A revised algorithm used to determine predicates that result in a crash are as follows:

Repeat the following steps until no runs are left:
1. Compute increase, F(), etc. for all predicates
2. Rank the predicates
3. Add the top-ranked predicate P to the result list
4. Remove P and discard all runs where P is true
- This simulates fixing the bug corresponding to P
- Discard reduces rank of correlated predicates

Ranking

A couple of strategies exist for ranking, we'll discuss some here:

Ranking by increase - at first glance, useful, however, this often results in a high increase score but few failing runs. Not useful because of how rare it is in relation to the number of runs.
- These are known as sub-bug predictors, covering special cases of more general bugs.
Ranking by failure - the problem here is that there could be many failing runs but low increase scores.
- These are known as super-bug predictors, covering several different bugs together.

A helpful analogy

To better select bug predictors, we want to use the following concepts:

Precision - fraction of retrieved instances that are relevant
Recall - fraction of relevant instances that are retrieved
Retrieved instances - predicates reported as bug predictors
Relevant instances - predicates that are actual bug predictors

With regard to the described concepts above, we need to achieve both high precision and high recall.

Combining precision and recall

The increase metric has high precision and low recall, whereas F() has high recall and low precision. A standard solution to combine these is to take the harmonic mean of both metrics:

2 / (1/increase(P) + 1/F(P))

This harmonic mean rewards high scores in both dimensions.

Delta Debugging

Delta debugging is a technique used to determine the source of a particular software bug. It follows the scientific method: hypothesize, experiment, and refine. With delta debugging, we continually reduce the input that produces a bug until we reach a minimum amount of input required to trigger the bug.

Simplification

We simplify our testing input to generate bugs for the following reasons:

Ease of communication
Easier debugging
Identification of duplicates

Simplification techniques

The first technique discussed in this lecture is the use of a binary search to remove input until the crash-producing input is identified.

Delta debugging algorithm

Delta debugging is similar to a binary search, however, when it fails to find a bug it increases the granularity of the changes or deltas that it tests. Deltas are blocks of input being used to test a target program. The delta debugging algorithm continues to try different subsets of deltas until it encounters a bug. The bug producing input is retained and the deltas increase in granularity until no bugs can be triggered.

Dynamic Symbolic Execution

Dynamic symbolic execution is a hybrid approach to software analysis, combining dynamic and static analysis. While dynamic symbolic execution can produce false negatives, it does not produce false positives - all of its assertions are real. Dynamic symbolic execution provides guided inputs to a program in an attempt to uncover all inputs to traverse all paths of execution.

Motivation

The motivation behind dynamic symbolic execution is as follows:

Provide automated test generation
Generate a regression test suite
Execute all reachable statements
Catch any assertion violations

Approach

Dynamic symbolic execution's approach to software analysis is as follows:

Stores program state concretely and symbolically
Solves constraints to guide execution at branch points
Explores all execution paths of the unit tested

Execution paths of a program

We can envision the execution paths of a program as a binary tree with possibly infinite depth - a computation tree. Each node of the computation tree represents the execution of a conditional statement, and each edge represents the execution of a sequence of non-conditional statements.

Each path in the tree represents an equivalence class of inputs. The point of dynamic symbolic execution is to discover non-equivalent inputs to uncover new tree branches and paths.

Symbolic execution

Given a series of branches and their condition statements, instead of brute-forcing the correct value to satisfy a condition statement to take a branch, symbolic execution leverages theorem provers to determine values that satisfy symbolic constraints.

With this, we avoid using concrete values for inputs, opting for symbols, and we execute the program symbolically to resolve input values for program paths. One problem exists, however. Due to the number of possible branch conditions within a program, symbolic execution is susceptible to path explosion.

Combined approach

Dynamic symbolic execution or concolic execution is our hybrid approach because it leverages concrete values during its analysis, while also conducting symbolic analysis of a specific program under test. Concolic exeuction starts with random input values, keeping track of both concrete values and symbolic constraints. To assist in proving that a branch can be taken symbolically, a dynamic symbolic execution engine will proceed to simplify its symbolic constraints with concrete values. This helps to avoid issues like path explosion, however, this also requires the engine to use an incomplete theorem prover.

Characteristics of DSE

The following are characteristics of dynamic symbolic execution:

Automated, white-box
Systematic input searching
Path-sensitive
Non-sampled instrumentation