During my internship, I worked with the Unidata THREDDS team. My intentions this summer were to learn Java, improve my coding skills, and have experience using it in real world applications. I began my journey by converting existing unit tests for the netCDF-Java library, which is tightly linked to the THREDDS Data Server (TDS) code, to the JUnit Java testing framework. Once I got this practice with Java and had a working development environment, I was able to start working on my summer project.
With the extensive increase in the use of machine learning models in Earth science related research, my project was an initiative in the direction of providing new datasets intended for machine learning use. Since Earth sciences has become substantially data-driven, with a variety of forecast models, large model simulations, and satellite missions, there is an unprecedented rise in raw unprocessed data. When working with machine learning models, significant preprocessing of the data is required; this involves cleaning, re-scaling, and splitting the dataset. The goal of re-scaling is to transform features to be on a similar range and improve the performance and training stability of the model. It is not always that re-scaling is necessary, but it is essential when dealing with multiple variables. My project focused on performing dataset preprocessing, in this case re-scaling, before access by users targeting machine learning applications. After reviewing 13 papers from the AMS journal Artificial Intelligence for the Earth Systems (AIES), my Unidata mentor and I selected standardization and normalization (common types of re-scaling) for implementation as part of my project.
I decided to implement two functions in Java based on the
MinMaxScaler functions from Scikit-learn, a Python machine
learning library. Using an external Java library suitable for large data streams
(Apache Commons Mathematics Library), I created the classes
Normalizer. Next, they were integrated into the netCDF-java codebase.
This process included creating constants/attributes in the Common Data Model class
for the standardization and normalization, adding
Normalizer in the set of possible data enhancements, applying enhancements to the
data if "standardizer", or "normalizer" was a netCDF variable attribute and data was
floating point type. Using the new classes in the TDS through the
NetCDF Markup Language (NcML) allowed the creation of a virtual dataset that
could be returned to
the user without altering the original data or requiring additional disk usage. By
making these processed datasets available to TDS users, we reduce the amount of data
preprocessing required on the user end.
The initial datasets chosen for preprocessing on the THREDDS test server were forecast (GFS) and satellite (GOES 18) data, due to their frequent use in the AIES papers reviewed. In addition to adding a mechanism to access the preprocessed datasets on the TDS test server, we included Jupyter notebooks for visualization of the preprocessed variables. I also created automated tests to evaluate if the code behavior was as expected, which involved unit testing and integration testing.
During the project, I also gained experience with GitHub in creating issues, pull requests and code review. Furthermore, tests on the performance difference with the use of the re-scaling were also evaluated. As next steps, the already reasonable results of the performance tests could be improved and more datasets relevant to the users could be provided.
This summer offered me invaluable personal and professional development opportunities including the Unidata Users Workshop, Project Pythia Hackathon, and the professional development workshops series for the UCAR interns. The combination of all the experiences throughout the internship contributed to help me build confidence in becoming a contributor to open source. Working on my project and the dedicated support from my mentor and the THREDDS team has deepened my passion for scientific software development.