November 16, 2020 - Workflows Best Practices 10:00 am
We will continue the discussion started on 2020-10-26 SWG Meeting notes.
Attendees:
- Luigi Marini
- Jong Lee
- Aaraj Habib
- Camille Goudeseune
- Charles Blatti
- Christopher Navarro
- Gregory Bauer
- @Michael Bobak
- Kaveh Karimi Asli
- Matt Berry
- Michal Ondrejcek
- Michelle Butler
- Michelle Gower
- Sara Lambert
- Mikolaj Kowalik
- Peter Groves
- Roland Haas
- Sandeep Puthanveetil Satheesan
- Timothy Andrew Manning
- Vara Veera Gowtham Naraharisetty
- Chen Wang
- Yong Wook Kim
- Elizabeth Yanello
Goals
- Come up with topics for follow up discussions and focus groups (FG)
- Research topics vs Tools ready to be used
- So many workflow systems
Questions
- Systems currently leveraged and developed
- Datawolf, Parsl, Airflow, YesWorkflow, Galaxy, LSST??, WDL/Cromwel/Miniwdl (on top of SLURM and HTCondor) (genomics group), bash script
Who are the systems for?
- Airflows doesn’t have a pretty UI to assemble steps. All code.
- Galaxy and the command line good UX
- Datawolf became an easy tool for developers
- Boxes and arrows for scientist
- Wrap existing tools as webservices
- Concurrency (services) / Parallelism (threads) / scaling out on different underlying resources
- In Python and R this is particularly needed
- Parsl helps with concurrency in python
- Makes it easy to overwhelm the system. Naive users might misuse the tool.
- The developers provide ongoing support once the tool is deployed - scientists give us code, and we need to give them guidance on how best to implement this with workflow (i.e. campus cluster)
- In Python and R this is particularly needed
- Abstract underlying clusters
- Portability: transition running a pipeline from one cluster to another
- Interoperability between workflows systems
- https://www.commonwl.org/ (Common Workflow Language)
- Do we use any tools that current support CWL?
- NOAA uses it for GFS forecasting
- National Cancer Institute
- Seven Bridges (cancer genomic platform)
- Knoweng moved some of the pipeline to Seven Bridges
- Share, reuse, reproduce. How often is this done in practice? If not often, why not?
- How is this different from package
- YesWorkflow (part of ?). They have a dashboard of other peoples workflow.
- Galaxy
- Weka, Rapid Miner, D2K, KNIME, data mining workflows, specific tool
- Packaging of tools vs workflow management
- Docker has helped with some of this
- Can collaborators (eg grad students in sciences) realistically build and maintain the workflow?
- How much help do they need?
- Is it realistic to leave them with a system to maintain vs bash script
- Knoweng had a profession development to wrap the tool. Team of undergrads supervised by Charles Blatti. Learning curve could be steep but not insurmountable.
- Datawolf, grad students created some tools in the past. Student can play with creating new workflows. We maintain the production workflows. Didn't allow students to delete workflows.
Potential Focus Groups
- Focus Group #1 - Categorization of active and used scientific workflow management system
- Dimensions
- User facing / Developer facing
- Maturity and Stability
- In house expertise
- Why it was picked in the first
- Preferred by certain funding agency / companies /
- Community / developer support
- Support for run history / job management / statistics
- CWL WDL support (these are not the execution environments)
- How generic is the back end? How portable is the definition language itself?
- Focus Group #2 - How to advise researcher/project on what to use
- What questions to ask when gathering requirements
- Advantages / disadvantages of adopting something that you didn't build