November 16, 2020 - Workflows Best Practices 10:00 am

We will continue the discussion started on 2020-10-26 SWG Meeting notes.

Attendees:

Goals

Questions

  • Systems currently leveraged and developed
    • Datawolf, Parsl, Airflow, YesWorkflow, Galaxy, LSST??, WDL/Cromwel/Miniwdl (on top of SLURM and HTCondor) (genomics group), bash script
  • Who are the systems for?


    • Airflows doesn’t have a pretty UI to assemble steps. All code.
    • Galaxy and the command line good UX
    • Datawolf became an easy tool for developers
    • Boxes and arrows for scientist
    • Wrap existing tools as webservices
    • Concurrency (services) / Parallelism (threads) / scaling out on different underlying resources
      • In Python and R this is particularly needed
        •  Parsl helps with concurrency in python
      • Makes it easy to overwhelm the system. Naive users might misuse the tool.
      • The developers provide ongoing support once the tool is deployed - scientists give us code, and we need to give them guidance on how best to implement this with workflow (i.e. campus cluster)
    • Abstract underlying clusters
    • Portability: transition running a pipeline from one cluster to another
  • Interoperability between workflows systems
    • https://www.commonwl.org/ (Common Workflow Language)
    • Do we use any tools that current support CWL?
      • NOAA uses it for GFS forecasting 
      • National Cancer Institute
        • Seven Bridges (cancer genomic platform) 
        • Knoweng moved some of the pipeline to Seven Bridges
  • Share, reuse, reproduce. How often is this done in practice? If not often, why not?
    • How is this different from package
    • YesWorkflow (part of ?). They have a dashboard of other peoples workflow.
    • Galaxy
    • Weka, Rapid Miner, D2K, KNIME, data mining workflows, specific tool
    • Packaging of tools vs workflow management
      • Docker has helped with some of this
  • Can collaborators (eg grad students in sciences) realistically build and maintain the workflow?
    • How much help do they need?
    • Is it realistic to leave them with a system to maintain vs bash script
    • Knoweng had a profession development to wrap the tool. Team of undergrads supervised by Charles Blatti. Learning curve could be steep but not insurmountable.
    • Datawolf, grad students created some tools in the past. Student can play with creating new workflows. We maintain the production workflows. Didn't allow students to delete workflows.

Potential Focus Groups

  • Focus Group #1 - Categorization of active and used scientific workflow management system
    • Dimensions
      • User facing / Developer facing
      • Maturity and Stability
      • In house expertise
      • Why it was picked in the first 
      • Preferred by certain funding agency / companies / 
      • Community / developer support
      • Support for run history / job management / statistics
      • CWL WDL support (these are not the execution environments)
      • How generic is the back end? How portable is the definition language itself?
  • Focus Group #2 - How to advise researcher/project on what to use
    • What questions to ask when gathering requirements
    • Advantages / disadvantages of adopting something that you didn't build
  • No labels