Reproducible science discussed at the Jupyter for Science workshop
Jupyter is the “Google Docs” of data science. It provides that same kind of easy-to-use ecosystem, but for interactive data exploration, modelling, and analysis. However, some work still needs to be done to make Jupyter the best interactive and practical tool for big science. Doing this right will take a community: New collaborations between core Jupyter developers, engineers from high-performance computing (HPC) centres, staff from large-scale experimental and observational data (EOD) facilities, users and other stakeholders. Many facilities have figured out how to deploy, manage, and customize Jupyter, but have done it while focused on their unique requirements and capabilities. Still others are just taking their first steps and want to avoid reinventing the wheel. With some initial critical mass, it is possible that different actors start contributing what they have learned separately into a shared body of knowledge, patterns, tools, and best practices. To this aim, on 11-13 June, a Jupyter Community Workshop was held at the National Energy Research Scientific Computing Center (NERSC) and the Berkeley Institute for Data Science (BIDS), and brought about 40 members of this community together to start distilling. Over three days in talks and breakout sessions, participants addressed pain points and best practices in Jupyter deployment, infrastructure, and user support; securing Jupyter in multi-tenant environments; sharing notebooks; HPC/EOD-focused Jupyter extensions; and strategies for communication with stakeholders. Michael Milligan from the Minnesota Supercomputing Center set the tone for the workshop with his keynote, “Jupyter is a One-Stop Shop for Interactive HPC Services”. Michael is the creator of BatchSpawner and WrapSpawner, JupyterHub Spawners that let HPC users run notebooks on compute nodes supporting a variety of batch queue systems. Contributors to both packages met in an afternoon-long breakout to build consensus around some technical issues, start managing development and support in a collaborative way, and gel as a team. Securing Jupyter is a huge topic. Thomas Mendoza from Lawrence Livermore National Laboratory talked about his work to enable end-to-end SSL in JupyterHub and best practices for securing Jupyter. Outcomes from two breakouts on security include a plan to more prominently document security best practices, and a future focused specifically on security in Jupyter. Speakers from Lawrence Livermore and Oak Ridge National Laboratories, the European Space Agency showed a variety of beautiful JupyterLab extensions, integrations, and plug-ins for climate science, complex physical simulations, astronomical images and catalogs, and atmospheric monitoring. People at a variety of facilities are finding ways to adapt Jupyter to meet the specific needs of their scientists. Also PaNOSC (https://panosc.eu) and its goals were presented to the community. In particular, Robert Rosca from the European XFEL gave the presentation “Jupyter for reproducible science at photon and neutron facilities”. After an introduction to the project, Rosca brought a few examples of use cases related to the reproducibility and re-usability of published results, and the possibility of making new data analysis on existing data sets, and showcased some of the challenges that the project will have to tackle during its implementation. The session ended with a stimulating discussion about what users at different facilities do with Jupyter. One of the main conclusions was that there should be a major focus on real-time collaboration, data management/exploration within the notebooks, improving the reproducibility of notebooks and (related to that) containerised notebook execution. Download here the presentation by Robert Rosca >> http://bit.ly/2lGSb6J Have a look at the breakout session brainstorming here >> http://bit.ly/2m2R3L7
Keywords
EOSC, Data Analysis, Jupyter, Reproducible science