Lingxiao Jiang's Research Page

Profile-Guided Program Simplification for Effective Testing and Analysis

It is a shame that so many profiling data has been collected from software users but we (software engineers) have not done enough to utilize the data for better software quality. This paper shows we can do better in this aspect by demonstrating the improved capabilities of CUTE and BLAST combined with valuable profiling information. Also, the idea of profile-guided program simplification may shed light on more purposeful, effective profiling tools that impose much less overhead on users. This work, accpeted into FSE'08, is based on many tools and data (CBI, CIL, SIR, the Aristotle Analysis System), and our previous work. [pdf]

Abstract: Many testing and analysis techniques have been developed for inhouse use. Although they are effective at discovering defects before a program is deployed, these techniques are often limited due to the complexity of real-world code and thus miss program faults. It will be the users of the program who eventually experience failures caused by the undetected faults. To take advantage of the large number of program runs carried by the users, recent work has proposed techniques to collect execution profiles from the users for developers to perform post-deployment failure analysis. However, in order to protect usersĄŻ privacy and to reduce run-time overhead, such profiles are usually not detailed enough for the developers to identify or fix the root causes of the failures.

In this paper, we propose a novel approach to utilize user execution profiles for more effective in-house testing and analysis. Our key insight is that execution profiles for program failures can be used to simplify a program, while preserving its erroneous behavior. By simplifying a program and scaling down its complexity according to its profiles, in-house testing and analysis techniques can be performed more accurately and efficiently, and pragmatically program defects that occur more often and are (arguably) more relevant to users will be given preference during failure analysis. Specifically, we adapt statistical debugging on execution profiles to predict likely failure-related code and use a syntax-directed algorithm to trim failure-irrelevant code from a program, while preserving its erroneous behavior as much as possible. We conducted case studies on a testing engine, CUTE, and a software model checker, BLAST, to evaluate our technique. We used subject programs from the Aristotle Analysis System and the Software-artifact Infrastructure Repository (SIR). Our empirical results show that using simplified programs, CUTE and BLAST find more bugs with improved accuracy and performance: they were able to detect 20 and 21 (out of 139) more bugs respectively in about half of the time as they took on the original test programs.

Back to top

Scalable Detection of Semantic Clones

This is a significant step towards "semantic-equivalent" code clone detection, which implies many interesting applications. Check it out in ICSE'08! Mark Gabel and Prof. Zhendong Su have actively been in the vanguard of this technology improvement based on program dependency graphs, over Deckard's tree-based approach, with the tool support of CodeSurfer from GrammaTech. [pdf, ppt, on ACM Portal]

Back to top

Context-Aware Statistical Debugging: From Bug Predictors to Faulty Control Flow Paths

This work, accpeted into ASE'07, improves previous bug localization techniques with fault control flow paths that may provide additional contextual information for debugging. It is based on a lightweight program instrumentation infrastructure: the Cooperative Bug Isolation Project by Prof. Ben Liblit. It also utilizes two machine learning tools: LIBSVM and RandomForests. [pdf, ps, slides.pdf, on ACM Portal]

Abstract: Effective bug localization is important for realizing automated debugging. One attractive approach is to apply statistical techniques on a collection of evaluation profiles of program properties to help localize bugs. Previous research has proposed various specialized techniques to isolate certain program predicates as bug predictors. However, because many bugs may not be directly associated with these predicates, these techniques are often ineffective in localizing bugs. Relevant control flow paths that may contain bug locations are more informative than stand-alone predicates for discovering and understanding bugs. In this paper, we propose an approach to automatically generate such faulty control flow paths that link many bug predictors together for revealing bugs. Our approach combines feature selection (to accurately select failure-related predicates as bug predictors), clustering (to group correlated predicates), and control flow graph traversal in a novel way to help generate the paths. We have evaluated our approach on code including the Siemens test suite and rhythmbox (a large music management application for GNOME). Our experiments show that the faulty control flow paths are accurate, useful for localizing many bugs, and helped to discover previously unknown errors in rhythmbox.

Back to top

Context-Based Detection of Clone-Related Bugs

The idea in this paper follows our previous work on code clone detection. We'd like to realize many potential applications of Deckard, and bug detection, which is related to our work on software reliability in general, got the first priority. Luckily, we got interesting results and were accepted into ESEC/FSE'07. We are continuously performing additional experiments to make this bug detection approach into reality. [pdf, ps, slides.pdf, on ACM Portal]

Abstract: Studies show that programs contain much similar code, commonly known as clones. One of the main reasons for introducing clones is programmers' tendency to copy and paste code to quickly duplicate functionality. We commonly believe that clones can make programs difficult to maintain and introduce subtle bugs. Although much research has proposed techniques for detecting and removing clones to improve software maintainability, little has considered how to detect latent bugs introduced by clones. In this paper, we introduce a general notion of context-based inconsistencies among clones and develop an efficient algorithm to detect such inconsistencies for locating bugs. We have implemented our algorithm and evaluated it on large open source projects including the latest versions of the Linux kernel and Eclipse. We have discovered many previously unknown bugs and programming style issues in both projects (with 57 for the Linux kernel and 38 for Eclipse). We have also categorized the bugs and style issues and noticed that they exhibit diverse characteristics and are difficult to detect with any single existing bug detection technique. We believe that our approach complements well these existing techniques.

Back to top

Deckard: Scalable and Accurate Tree-based Detection of Code Clones

This work began in Summer 05 when Prof. Zhendong Su had a visitor, Stephane Glondu, from ENS Cachan, France. Later, with Ghassan Misherghi's great help, Deckard became a practical tool and made into ICSE'07. [pdf, ps, slides.pdf, on IEEE Xplore]

Abstract: Detecting code clones has many software engineering applications. Existing approaches either do not scale to large code bases or are not robust against minor code modifications. In this paper, we present an efficient algorithm for identifying similar subtrees and apply it to tree representations of source code. Our algorithm is based on a novel characterization of subtrees with numerical vectors in the Euclidean space R^n and an efficient algorithm to cluster these vectors w.r.t. the Euclidean distance metric. Subtrees with vectors in one cluster are considered similar. We have implemented our tree similarity algorithm as a clone detection tool called Deckard and evaluated it on large code bases written in C and Java including the Linux kernel and JDK. Our experiments show that Deckard is both scalable and accurate. It is also language independent, applicable to any language with a formally specified grammar.

Back to top

Osprey: A Practical Type System for Validating Dimensional Unit Correctness of C Programs

This work is finished with Prof. Zhendong Su and accepted into ICSE'06.
[pdf, ps, slides (pdf), refer to the portal on ACM.ORG for more; tools used in Osprey--Cqual, Banshee, CLAPACK; being improved with ROSE]

Abstract: Misuse of measurement units is a common source of errors in scientific applications, but standard type systems do not prevent such errors. Dimensional analysis in physics can be used to manually detect such errors in physical equations. It is, however, not feasible to perform such manual analysis for programs computing physical equations because of code complexity. In this paper, we present a type system to automatically detect potential errors involving measurement units. It is constraint-based: we model units as types and flow of units as constraints. However, standard type checking algorithms are not powerful enough to handle units because of their abelian group nature (e.g., being commutative, multiplicative, and associative). Our system combines techniques such as type inference and Gaussian Elimination to overcome this problem. We have implemented Osprey, a prototype of the system for C programs, and evaluated it on various test programs, including computational physics and mechanical engineering applications. Osprey discovered unknown errors in mature code; it is precise with few false positives; it is also efficient and scales to large programs---we have successfully used it to analyze programs with hundreds of thousands of lines of code.

Back to top

Automated Bug Localization with Machine Learning

This work tries to locate causes of software faults via machine learning based on the program instrumentation infrastructure -- the Cooperative Bug Isolation Project by Prof. Ben Liblit. It evolves from a class project for Machine Learning (Prof. Rao Vemuri) and is improved with Prof. Zhendong Su to a paper submitted for review.

Back to top

A Polynomial Time Solver for A Class of Integer Range Constraints

This is an implementation of the algorithm in A Class of Polynomially Solvable Range Constraints for Interval Analysis without Widenings and Narrowings by Zhendong Su and David Wagner. It is done under the guidance of Prof. Zhendong Su.

Back to top

Memory Disambiguation via Integer Range Analysis

This is a direct application of the above integer range constraint solver and conducted with Zhihong Ding, Huan Song for Computer Architecture class (Prof. Matthew Farrens). We hope it can be improved to be a polished paper and brought into practice.

Abstract: Multiple-issue and out-of-order execution microprocessors improve performance via aggressive techniques for exploiting high level instruction-level parallelism (ILP). One of the major impediments to ILP is ambiguous memory dependences. Memory disambiguation, the process of determining whether two memory instructions might access the same location, is thus valuable for high level ILP. In this paper we use a compiler-based static analysis to perform memory disambiguation. Our approach has three steps. In the first step it constructs a control flow graph (CFG) from the source assembly codes. In the second step the algorithm traverses the CFG and divide it into basic blocks. Also, it generates range constraints for every basic block. In the third step range constraints are solved by a quadratic time algorithm. The solution can then be used to help reduce stalls caused by ambiguous memory dependences. We have implemented our algorithms in C, and initial results show that for some programs a large amount of stalls can be removed.

Back to top

Practical Impact of Principle of Complete Mediation

This project is a class project done with Lynn Nguyen, Derrick Pallas and Patrick Wheeler for Computer Security (Prof. Mattew Bishop). The result is interesting but counter-intuitive. We hope we can find out why.

Abstract: One of the security principles, Complete Mediation, is often not enforced because it is believed that it would have an influence on the performance. However, our experiments show that it is not always true, at least in terms of file systems. We enforced security checks for every file operation on Linux kernel 2.4 and 2.6 via two different approaches: hijacking file system calls and using Linux Security Modules. The benckmarks we used show that the overhead of complete mediation is negligible except for few extreme circumstances.

Back to top

(More to come)

Valid XHTML 1.0! Valid CSS!