This group is a host for research into the use of high performance computing (HPC) for primary genomics analyses, such as alignment, variant calling, genome assembly, and RNASeq. By its nature, this research is highly collaborative. Every member of our team is affiliated with multiple departments or campus initiatives. The student participants in this group serve as a bond between the campus faculty using computational genomics analyses in their research, and the NCSA experts in HPC, storage, networking, databases, etc. Together we enable the use of advanced computing infrastructure in computational genomics. Explore this page to find out who is involved, how we are connected, and what projects are currently ongoing.
NCSA Press: Crossing over, branching out: Meet the NCSA Genomics team
Table of Contents |
---|
Technical Program Manager, National Center for Supercomputing Applications
Research Assistant Professor, Institute of Genomic Biology
217-300-0568
NCSA Genomics, September 2017. Credit: Steve Deunsing
Current People and Projects | ||
Ramshankar Venkatakrishnan, Research Programmer B.S. Electronics & Communications (2012) M.S. Electrical & Computer Engineering (2015) | Mayo Grand Challenge: evaluating and streamlining genomics workflowsRamshankar is working on computational improvements for the Mayo Grand Challenge, a genomics research Ram will also contribute his hardware expertise to the project, evaluating system architecture options to complement the team’s | |
Katherine Kendig, Associate Project Manager B.A. Anthropology (2012) M.F.A. Creative Writing (2017) | Project ManagementKatherine is a project manager with the NCSA Industry Program, working primarily with biomedical partners. She benchmarked the Sentieon variant calling software for the Mayo Grand Challenge: https://www.biorxiv.org/content/10.1101/396325v1 She has also contributed to NCSA’s Public Affairs team, writing articles about NCSA and XSEDE research: After the storm; Bringing supercomputing to psychology; DISSCO Tech; ECSS: Profiles in Consulting; NCSA Genomics; History was here | |
Brian Bliss, Research Programmer | Data compressionBrian will be working on data compression for the Mayo Grand Challenge project. | |
Dan Lanier, Research Programmer B.S. Applied Mathematics (2008) | NCSA IndustryDan supports biomedical partners in the NCSA Industry program. Dan provides a complementary mix of expertise in HPC and mathematical data analysis to enable pharmaceutical, agricultural and medical companies to utilize the high performance computing resources at NCSA. | |
Weihao Ge B.S. Physics (2008) M.S. Physics (2011) Ph.D. Biophysics (2018) advised by Dr. Eric Jacobsson | Search Space ReductionWeihao is evaluating statistical methods for search space reduction in the analysis of GWAS data for genomic variant Her work is part of the CCBGM project "Scaling the Computation of Epistatic Interactions in GWAS Data." | |
B.S. Crop Science (2016) Plant Biotechnology, Molecular Biology M.S. Bioinformatics (2018) Department of Crop Sciences, UIUC Graduate Fellow in the College of ACES advised by Dr. Matthew Hudson | Genomic variant calling by assemblyMr. K is focusing on a method to detect genomic variants by assembly. He is employing the software Cortex-var, which constructs de-novo genome assembly on multiple Mr. K is also working with Tiffany on the genomic analysis of HLHS for the Mayo Grand Challenge. Poster: Variant Calling by Assembly Poster: Reference-guided variant calling for non-repetitive sequences in Glycine Max | |
Brian Rao B.S Integrative Biology (2018) Minor in Informatics | Brian writes and tests the variant calling workflow code for the Mayo Grand Challenge. He is focusing on the accuracy and performance considerations of tumor variant detection in clinical settings. | |
Graduate Students | ||
Prakruthi Burra B. E. Computer Science (2018) M.S. Biological Sciences (2018) | Workflow management for variant callingPrakruthi is implementing a variant calling workflow in Nextflow (a workflow manager). She is also in charge of testing the workflow developed for the Mayo Grand Challenge before delivery. Human Heredity & Health in AfricaPrakruthi will be contributing to UIUC's work with the H3Africa Consortium. | |
Dave Istanto B.S. Crop Sciences (2018) | Workflow management for structural variant callingDave is creating a Nextflow workflow for structural variant calling using Cortex-var. | |
Undergraduate Students | ||
Dipro Ray B.S. Computer Science (2020) Minor in Mathematics | Resolving Racial Disparities by Applying Statistics on Complex, Multidimensional Datasets Dipro is working on turning a proof-of-concept prototype, of a statistical pipeline to analyze health data, into a well-structured open source package that is very portable, containerized and deployable through the cloud (like AWS), making such critical software available to researchers and collaborators with only a few commands. In pursuit of this goal, Dipro also works on refining the statistical pipeline in a modular manner and chalking out key design decisions for its implementation, and improving the package's computational efficiency (by making use of the host computer's architecture and resources)." | |
This project aims to deploy variant calling workflows implemented using systems such as WDL and Nextflow in AWS and other cloud services. | ||
High School StudentsWe have several high school students working with our team to gain skills and complete projects in a real-world environment. | ||
Sophia Torrellas | Sophia and Angelynn are benchmarking the performance and accuracy of Minimap2 (Li, 2018) - Minimap2 maps the sequencing reads against the reference genome for the species. | |
Angelynn Huang |
Former Group Members | ||
Ellen Nie B.S. Computer Science (2018) | Big data network transfers for genomicsEllen is benchmarking the network transfers of genomic data across multiple sites.She wants to understand the limitations of modern network backbone for big data genomics, and to facilitate correct configuration of the endpoints to resolve those limitations. Ellen is looking at the sites of our collaborators in Toronto, South Africa, Sudan, and the UK. Poster: Benchmarking and Optimization of Long Distance Big Data Transfers Validation of Sentieon - the fast alternative to GATKEllen is also collaborating with OICR to validate the speed and accuracy of the new software package for genomic variant calling, called Sentieon DNASeq. Convert Java-based GWAS code for SparkIn a project described below (Accurate and scalable GWAS algorithms) we are improving performance of a stepwise epistatic model selection for Genome-Wide Association Studies. The method itself works well, but the current Java implementation is way too slow for modern data sizes. We would like to deploy this Java code on Spark, to see if the necessary performance gains could be obtained. A successful student applicant will use Java Spark API to adapt the current code for a Spark platform that is being deployed at NCSA ISL2.0. This code will be validated for correctness in collaboration with a student statistician from the lab of Dr. Lipka, who developed this statistical method. Poster: Scaling the Computation of Epistatic Interactions in GWAS Data | |
Tiffany Li B.S. Integrative Biology (2018) minor in Computer Science | Benchmarking performance and accuracy of genomic variant calling softwareTiffany collaborates to document our efforts in benchmarking variant calling on HPC systems. We have run variant calling experiments on 500 genomes in parallel, on Blue Waters, to identify performance bottlenecks when using the GATK best practices workflow. We have also tested a number of alternative software, such as Isaac, Genalice, and Sentieon, as well as Dragen - a hardware solution. Tiffany is documenting the pros and cons of each of these excellent approaches in a separate manuscript. Validation and benchmarking on ParFu - a parallel file packaging utilityTiffany is also involved in testing and benchmarking of ParFu, an MPI tool for creating or extracting directory tree archives written by Dr. Craig Steffen, who works in the Blue Waters team. | |
Sijia Huo B.S. Mathematics & Computer Science (2018) second major in Statistics third major in Economics | Parallelization of RSijia is working with NCSA Faculty Fellow Dr. Zeynep Madak-Erdogan to introduce parallel R code into her research. Dr. Madak-Erdogan is exploring racial disparities in breast cancer occurrence through the lens of diet and nutrition. | |
Ryan Chui B.S. Biochemistry (2016) M.S. Bioinformatics (2017) | NCSA IndustryRyan performed software installation, benchmarking, and development for a variety of industry partners.To investigate how the training time for deep neural networks (DNN’s) can be affected, Ryan worked with TensorFlow, Google’s deep learning library, to perform multi-label classification on a data set. He built an autoencoder – an unsupervised deep neural network - to extract salient features from the On Github: EpiQuant: Hadoop, C, Tensorflow - epistasis software prototypes MLCC - multi-label cancer classification q2b - binary representation of nucleotides ptgz - parallel tar gzip Usage Analyzer - log analyzer for HPC schedulers | |
Jennie Zermeno B.S. Integrative Biology (2017) | Benchmarking performance and accuracy of genomic variant calling softwareJennie collaborated to document our efforts in benchmarking variant calling on HPC systems. Jennie also participated in the debugging of the H3ABioNet GATK Germline Workflow. Bioinformatics in the CloudJennie is investigating the issues of portability, reproducibility and scaling of bioinformatics workflows in cloud infrastructure by instantiating containerized versions of workflows. Students Capitalize on Computational Genomics Research Using AWS | |
Angela Chen M.S. Statistics (2017) Department of Statistics, UIUC CompGen fellow advised by Dr. Alexander Lipka | Accurate and scalable GWAS algorithmsAngela and Khory collaborated to improve the scalability and parallelization of the statistical software TASSEL5, widely used for conducting genome wide association studies (GWAS) in plants. Angela wrote a manuscript to demonstrate that her new stepwise epistatic model selection procedure has greater statistical power compared to other methods. However, the Java-based TASSEL5 cannot be easily parallelized across multiple nodes in a computational cluster, to run on Khory provided the expertise in computer science to convert this Java code into C++ and parallelize it in HPC environment. | |
Khory Wagner advised by Dr. Vologymyr Kindratenko | ||
Nainika Roy B.S. Molecular and Cellular Biology (2017) minor in Informatics and Chemistry SPIN fellow | Data formats and data structures in computational genomics | |
Junyu Li B.S. Molecular and Cellular Biology (2017) minor in Computer Science SPIN fellow | Genomic variant calling by assemblyJunyu worked with Mr. K in an interdisciplinary team, providing the expertise in math and computer science to automate the Cortex-var workflow and interpret the algorithm. Poster: Reference-guided variant calling for novel non-repetitive sequences in Glycine max | |
Noah Flynn B.S. Bioengineering, Mathematics (2017) minor in computer science SPIN fellow | Evolution of molecular networks and persistence of organisms | |
Jacob Heldenbrand, Research Programmer B.S. Biochemistry (2014) M.S. Bioinformatics (2016)
| NCSA IndustryJacob supports biomedical partners in the NCSA Industry program. Jacob provides a complementary mix of expertise in HPC and bioinformatics data analysis to enable pharmaceutical, agricultural and medical companies to utilize the high performance computing resources at NCSA. Jacob and Azza Ahmed (Ph. D. candidate, University of Khartoum) are exploring and evaluating the | |
B.S. Molecular and Cellular Biology (2016) M.S. Bioinformatics (2018) Department of Crop Sciences, UIUC CompGen fellow advised by Dr. Matthew Hudson | Mutation profiles of cancerMr. Weber is developing machine learning methods to effectively stratify cancers based on the statistical properties of mutations found in afflicted individuals. Cancer stratification is predictive of disease outcomes, drug response and drug metabolism. Effective computational approaches based on total data acquired to-date can make this process cheaper in the clinic. Matt collaborates with the Ontario Institute for Cancer Research to make sure his models are realistic. Paper: Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models Poster: Statistical models to capture mutational properties for NextGen Sequencing Data | |
Aishwarya Raj B.S. Biochemistry (2019) minor in Bioinformatics | Evolution of molecular networks and persistence of organismsConstruct and compare gene, metabolic and signaling networks from organisms across the tree of life. The goal of the project is to provide support for the general framework of persistence strategies. It postulates that persistence is achieved by biological systems via a tradeoff of traits that serve either economy, flexibility, or robustness. In this project we want to determine and quantify the molecular mechanisms that underlie these persistence strategies. Will analysis of the biomolecular networks allow us to differentiate between organisms of differing economy, flexibility, and robustness, and subsequently classify unknown, newly discovered, or modified organisms within such predefined Poster: Persistence Strategies in Biomolecular Network Architecture NCUR Slides: Architecture and Dynamics of Biomolecular Networks Facilitate Evolution of Persistence Strategies in Living Organisms | |
Cynthia Liu B.S. Bioengineering (2019) minor in Computer Science | Workflow management comparisonsCynthia worked to learn the Nextflow system for workflow management and to compare and contrast
Poster: Comparative Analysis of Genomic Sequencing Workflow Management Systems |
Other Collaborations
Dr. Matthew Hudson Bioinformatics Crop Science | HPCBio, Carver Biotechnology Center | |
Dan Wickland Ph.D. Informatics (2019) | ||
Computer Science | NCSA Scientific Software and ApplicationsPortable variant calling workflow in Swift
| |
Azza Ahmed Computer Science advised by Dr. Faisal Fadlelmola | ||
Dr. Zeynep Madak-Erdogan Food Science & Human Nutrition | Madak-Erdogan LabSystems Biology of Estrogen Signaling
| |
Brandi Smith Ph.D. Food Science and Human Nutrition (2021) | ||
H3Africa Consortium
| ||
Morgan Taschuk Bioinformatics | OICR
| |
Paul Hatton HPC / Visualisation
| University of Birmingham |
...