You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 36 Next »

Welcome to the Digital Transformation Institute!

You have been given a grant as part of the new Digital Transformation Institute (DTI)!
To make the start of your DTI experience as fast as possible, we have assembled a set of resources to:

  1. Introduce researchers of all stripes to the system
  2. Help researchers determine what level of training they will need to leverage's resources
  3. Point researchers directly to relevant documentation they will need
  4. Provide worked examples of different research workflows and how they may be ported into
    the environment, or may use's resources

If you have questions not covered by this guide, please contact the DTI team at the email

Introduction to the system is a data analytics engine designed to make the ingestion and analysis of heterogeneous data sources as painless as possible. The platform joins data from multiple sources into a single unified federated data image. With the federated data image defined, then provides an API to access that data, and in the case of time-series data, perform numerous transformations and computations all producing normalized time-series data at regular intervals. also supports R and Python Jupyter notebook analysis of the federated data image. These notebooks provide a great way for researchers to analyze data close to where the data is stored. While supports many data science capabilities familiar to the researcher, some expected functionality may be missing. For these cases, supports implementing new data processing functions in python and javascript.

Like any other API porting your own workflows will take some care and time to learn properly. Please leverage this guide to make understanding the platform and porting your workflow as quick and easy as possible.

Services available from

  • Covid-19 Datalake: This unified federated Datalake includes data from numerous sources.
  • computing platform
  • Integrated Development Studio
  • Jupyter notebooks
  • Marketplace
  • UI system for creating dashboards

How does differ from traditional HPC systems?

  • Traditional HPC systems are similar to Hardware as a Service (HaaS), while is more like a Platform as a Service (PaaS).
    Users are encouraged to work within the platform's API to achieve the best performance out of
  • offers a state-of-the-art data integration system as the basis for all Data Science operations.
    This is in contrast to HPC systems where all components of data management and the analysis pipeline must be installed and managed independently.

What types of software can be run on

  • Nearly any python module may be installed and used through pip or conda
  • Nearly any R package may be installed and used within the R juptyer environment.

What types of software cannot be run on

  • General binary executables are not supported by out of the box.
  • MPI-based python software
  • Packages which must be built from scratch on the platform, or require specific hardware drivers
  • Python modules which require special built binaries may not run as well.

How do I get started?

Use this guide to determine what training you need to utilize resources effectively. We have identified four
categories of usage of the platform. We include basic examples of workflows which might fall into that level,
pros and cons of operating on that level, and a list of training resources we recommend resources researchers
completing on the DTI training environment before starting their allocations. This will ensure researchers will
be able to use their allocation as efficiently as possible.

Examine the high level overviews of each level below, then click the section titles to go to more in-depth
discussions related to that level, like the recommended training.

Level 1: Use COVID-19 Datalake Only

For many researchers, they will simply want to leverage the COVID-19 Federated Data Image.


  • Easy to integrate into existing scientific workflows and run on existing scientific computational hardware
  • Publicly available API means no credentials are needed to access the data
  • Assuming you have access to your own computational resources, you don't have to worry about allocations
    on the platform.


  • All data used from the Datalake must be streamed to wherever you're processing data
  • Performance benefits from working with the Datalake using will not be available.

Level 2: GUI based data analysis on provides a wonderful GUI-based interface to the system with their Integrated Development Studio. Such
an environment is likely to be attractive to many researchers. This level is the easiest way to integrate new
data onto the Datalake.


  • GUI interface to manage Types and data integration.
  • GUI interface to piece together ML pipelines.
  • Ability to load new data onto the Datalake


  • Some types of workflows may not be easily defined within the GUI framework

Level 3: Utilize Suite and Jupyter notebook analysis

Some researchers will want to write their own package and leverage more of the AI Suite through Jupyter notebooks. allows researchers to define their own types, methods, and use R and python to perform analysis on Datalake data.


  • Researchers can use a jupyter notebook to interface directly with their Data Model.
  • Researchers can often use exactly the same workflow they were using before.


  • We recommend users take a full set of training to completely familiarize themselves
    with the system before embarking on their analysis.

Level 4: State-of-the-art ML workflows requiring special ML models and/or GPUs

Some researchers will want to bring state-of-the-art ML workflows to can support such workflows, but
extra work may be needed.


  • Researchers can bring state of the art workflows close to the COVID-19 Datalake


  • The DTI team will evaluate on a case-by-case basis whether a workflow is appropriate for
  • Some workflows may require major effort to fit within the framework.


This section introduces the process to access Generally speaking, once you receive your grant,
the DTI team will reach out and discuss with you what your needs are.

  1. Determine which researchers will require access to a environment
  2. Each researcher will be given a developer portal login.
  3. Each researcher will be given a tag on the DTI training cluster.
  4. Once training is complete, Discuss with the DTI team what your needs
    for a cluster will be.
  5. The DTI will work with to stand up a new tag for your research.
  6. Access to that tag will be granted to your researchers
  7. Research can then proceed until your allocation is exhausted!

Essential Concepts is quite different from traditional HPC resources. We have written an introduction to from the
perspective of a scientific researcher. We go over several important concepts and relate them to
what scientists are more familiar with. Allocation Management

This section introduces How researchers will be expected to manage their allocation while on the platform.

This section will be expanded once the DTI team understands how this procedure will look to the researcher.

Special Compute Resource Information

Here you can find information about the special compute resources available to DTI researchers.

Comprehensive List of Available Training and Resources

See the above link for a comprehensive list and categorization of the available training
materials. This includes Documentation, DTI introductions, and DTI created examples and exercises.

Help! This guide doesn't solve my problem!

No problem! You're not alone! Please send an email to with a description of your issue
and one of our team will work with you to resolve your issue.


If you feel aspects of this guide are incomplete or inaccurate, please send an email to with the
issue or suggestion, and we will work to incorporate it to make the documentation better. We appreciate the new perspective
More eyes can bring to a software project!

Your DTI Team


Jay Roloff - Executive Director

Matthew Krafczyk - Data Analyst

Yifang Zhang - Data Analyst


Larry Rohrbach - Executive Director

Eric Fraser

Greg Merritt

Matt Podolsky

  • No labels