Due to the COVID-19 pandemic, Unidata's 2021 summer interns did not travel to Boulder to work on their projects in person. Instead, they interacted with Unidata developers through Slack, Zoom, and other electronic means.
I came into this summer internship with a goal of working on the Network Common Data Form (netCDF) libraries. NetCDF is a combination of software libraries and APIs describing a data model for scientific multidimensional arrays. I planned to improve the online user guide, write tutorial code, and learn about storage and efficiency.
Before this project, I had only used netCDF by calling high level functions to read and write data in MATLAB, which uses functionality from the netCDF-C library. The netCDF data model is a standard across languages, with programming interfaces in C, Java, Fortran, Python, MATLAB, R, and more. For the majority of this summer, I worked closely with the netCDF-Java library, updating and expanding the online user's guide.
I maintained the netCDF-Java documentation by updating tutorial code, testing code snippets, and modernizing tutorial text to improve user understanding. I started by improving the documentation by replacing raw HTML with Markdown, changing formatting, linking to relevant sites, updating UML diagrams, and including updated screenshots. I next moved on to update and rewrite the tutorial code in Java. I created a tutorial class for each page with every code snippet contained in a method. Viewing the code snippets inside of IntelliJ, I was able to fix deprecations and update the code after some major changes were made to the structure of the netCDF-Java library. I then used netCDF-Java’s jekyll plugin to insert the code snippets into the rendered html page. Finally, I created test classes to confirm the code was running properly. Changing the structure by moving code snippets to Java classes rather than inside the markdown file will ensure that when future changes are made, errors in the user guide will not go unnoticed. See one of my pull requests for user guide updates.
After improving the user guide documentation, I embarked on the second focus of my internship: performance testing in Python. Because of my interest in data storage and efficiency, my mentors suggested that I look into comparing data formats, including HDF5 and Zarr. HDF5 is a file format used by netCDF-4 providing compression and chunking to the netCDF data model. I switched from working in Java to Python so that I could compare reading times with Zarr, a Python-based data storage format. I compared reads with netCDF-3, netCDF-4 Classic, netCDF-4, Zarr, and Zarr being read with Xarray. The completed performance testing demonstrated that read times increase at varying scales as chunk size decreases. When chunk size was large, a Zarr directory store read was faster; however, as chunk size decreases, reads of netCDF-4 became much faster. I learned that the difference in read times is due to how each format stores data differently. A netCDF-4 file stores all data in one .nc file, consequently more operations are needed to find the appropriate data, but only one open is required. Zarr directory stores save chunked data as many subdirectories and files, meaning the more chunks, the more individual files in one Zarr directory store. You can see the notebooks I created, testing data, and full results in my GitHub repository.
My internship with Unidata allowed me to explore my own interests with netCDF software while sharing findings with the public. I was able to contribute to the open source community for the first time, conduct testing of my own, and gain professional development skills through my mentors and UCAR/Unidata’s community. I am very grateful for this summer opportunity and all the individuals who made this remote collaboration possible.