Open discussions on specific topics selected by the Software Working Group and selected from the list of SWG Topics For Discussion.
Tuesday, June 14, 2022 Data Science Community Group moderated by John MacMullen
Slides:
Recording:
Attendees:
Discussion:
See recording and slides for in depth content
Many of the software directorate work with data scientists and this topic is very relevant to all of us at NCSA. John MacMullen (MBDH) thanked everyone for joining.
NCSA Data Science Community Building was a project started with the help of Bill Gropp, Colleen Bushell and John MacMullen. Shannon Bradley is the PM for this project.
How does data science affect NCSA and the community in general?
Data Science is whatever you think it is, and many more things. What does data science mean to you?
Many of the projects in SWG deal with data science, and we help to create generic forms, but getting projects to collaborate center wide would help not only NCSA but all of the projects that we support.
Having tool kits for data management and data science would be beneficial to all.
Create definitions for data science, data engineering, data management. This is complicated.
Workforce development - data carpentry, scientific software development, pipelines, computing
The software engineers need to learn the languages that the researchers are familiar with, such as R, Python, Jupyter Notebooks, etc.
Discussion of correlating dataset and data management
Challenge of doing computations across datasets without the actual data available, such as HIPAA. When we have sensitive data that cannot be accessed, there is a secure way to access the datasets, run a query without actually seeing the data, this is something that NCSA would benefit from. This will help us with industry partners and health providers as well as campus partners.
Goal: Present NCSA as the leader on campus as a data service provider
Goal: develop an NCSA community for sharing
Goal: after getting the NCSA community together, add the campus community and external collaborators.
Next Steps: What would getting the NCSA community together in sharing projects, data, etc.? Will there be a metadata standard in format, which would definitely help in sharing data.
Kastan notes RE transforming out internal data products to be more compatible: https://www.getdbt.com/
Ana notes it would be helpful to have a group for internal data sharing
Best Practices Group for Data Science and Data Management.
SD is developing a Best Practices Handbook
There is a Slack Channel:
NCSA Slack channel #data-science-coordination
Sharing through GitHub, Jupyter Notebooks and Slack, rather than wiki
Who is responsible for maintaining datasets?
Accessing data can be tricky!
As part of MBDH, there are carpentry workshops and MBDH is a member of carpentry community. They also do workshops with training the trainer. NCSA could develop skills at these workshops. Would software be interested in attending these?
https://jupyterbook.org. We also like it because it is based on Sphinx (something we are familiar with) and it defaults to markdown (something easily portable).
Links mentioned in this Round Table:
If you are interested in contributing to a Round Table, please see these links: