A Tracing Virtual Machine for Statistical Computing

Over the last decade, the R Project has become a key tool for Bioinformatics, Computational Biology, and clinical research. R is part of a family of languages called dynamic languages, or scripting languages, which strive to optimize programmer time rather than machine time. Scripting languages are attractive for life science applications as they reduce programmer’s effort by a factor 3 to 60 as compared to traditional static languages such as Java or C++. Unfortunately, the trade-off for the reduced programmer’s effort is the decrease in the overall computational performance. Computations done entirely in R can be on average 600× slower than the same program in C. In addition to the computation speed, memory requirements limit R-based analyses of large-scale genomic data sets, and manipulations of complex data structures, which are increasingly commonplace in Bioinformatics. Although part of the performance differential can be attributed to the dynamic nature of R this is not the only reason, R is also slower than most other interpreted languages. In part this can be attributed to the dynamic nature of the language and the interactive nature of the R programming environment. But even for an interpreted language R is slow, close to 100× slower than a language such as Python which has comparable features.

This project applies some of the latest results in programming languages and compiler research to radically modernize the R execution environment and enable developers to use R without having to worry as much about performance. This project thus supports rapid development of novel software-based techniques by decreasing the cost of computationally intensive algorithms expressed in R and makes exploratory programming feasible for even the largest of data sets by improving the memory management subsystem. This work will demonstrate the potential of modern virtual machine implementation techniques for transforming the R system into a high performance execution environment. Instead of considering all possible use cases of R, we are focused on typical applications in Bioinformatics and Computational Biology, as exemplified by packages in the R-based project BioConductor. Our goal is to have a prototype available under a BSD-style open source license, which runs a significant portion of examples in the BioConductor educational materials and vignettes, and demonstrates an observable performance improvements (at least 100× over pure R code). To this end, we will address the following:

1. Develop methods and tools for software corpus analysis. In order to understand the overheads of the current implementation and identify opportunities for optimizations, we have developed an infrastructure and tools for empirical evaluation of the entire corpus of executable code in vignettes and education programs of the R-based BioConductor project, and statistically analyze their runtime behavior.

2. Implement a trace-based just-in-time compiler. We are implementing, within a new R virtual machine, a compiler that generates native code for frequently executed code fragments at run-time, based on dynamically collected program-traces. Similarly, the performance will be evaluated on the code corpus from Bioconductor, and on proteomic and metabolomic projects in our own research.

3. Implement a pauseless compacting and concurrent garbage collector. To reduce footprint and pause times of real-world R programs, we will design and implement a fragmentation- tolerant concurrent garbage collection algorithm. The performance of the implementation will be evaluated on the code corpus from BioConductor, and on proteomic and metabolomic projects in our own research.

4. Gradual Typing. We propose to extend the language with optional gradual type annotations to be used by developers to communicate static software constraints that the compiler can use to generate more efficient code.

The prototype called ReactoR implemented within this project looks to fully integrate the new execution environment within the legacy R development environment and libraries.

Child pages

A Tracing Virtual Machine for Statistical Computing