DiskCache2 Issue in the netCDF-Java Library

The Unidata THREDDS Development Team released updated versions of the THREDDS Data Server (TDS) and netCDF-Java/Common Data Model (CDM) library on June 17, 2020. In addition to feature enhancements, these releases contain a variety of updates to third-party libraries, including security updates. They also address a problem in previous versions that could lead to data returned by some NetcdfSubsetService (NCSS) requests being corrupted. While the circumstances under which the problem occurs are very specific (and rare), because the possibility of data corruption exists the development team strongly recommends these upgrades to anyone using netCDF-Java/CDM or TDS. TDS administrators who are not able to upgrade immediately should disable the NetcdfSubsetService until it is possible to do so.

Note that there are releases for both the 4.6.x branch of the TDS, which includes an associated version of the netCDF-Java library, and for version 5.x, for which the TDS and netCDF-Java code bases are separate. This release includes version 5.3.3 of the netCDF-Java library (supported release) and version 5.0.0-beta8 of the TDS (beta-test release).

About the Issue

The netCDF-Java ucar.nc2.util.DiskCache2 class method for generating unique file names was capable of producing identical names if called in quick succession. While the method was thread safe (synchronized), it was not timing safe. The TDS uses this cache as a temporary file storage directory to service NCSS requests when the response format is a netCDF and netCDF-4 file. When multiple NCSS requests are made in a sustained and rapid manner, the service could return a corrupt netCDF file, or possibly a netCDF file with incorrect metadata and/or data.

Background

The TDS is a Java-based web application that provides access to data and metadata through a variety of services, heavily leveraging the netCDF-Java library. One of the services, NCSS, is capable of returning spatial-, temporal-, and/or variable-level subsets of data in a variety of formats. NCSS has been part of the TDS in one form or another since 2007. Two of the return formats supported by NCSS (netCDF and netCDF-4) require a temporary file to be written to disk prior to the return, while the remaining formats (csv, xml, etc.) stream results of the subset operation directly back to the client.

The temporary files used when returning data in netCDF or netCDF-4 format receive their names from a netCDF-Java DiskCache2 cache. DishCache2 creates temporary file names using a supplied prefix and suffix, as well as the results from a random number generator. DiskCache2 has used this method for creating temporary file names since 2010, when TDS version 4.2 was released.

NCSS subsets come in three types: grid, grid as point, and point; each type uses a different prefix when requesting a temporary file name from the NCSS cache. The method for creating temporary file names is synchronized so that only one thread is allowed to use the method at a time, and as such it is considered thread safe. However, prior to the TDS 4.6.15 and netCDF-Java 5.3.3 releases, it was possible for two or more threads to be assigned the same temporary file name if the TDS received two NCSS requests under the following conditions:

  1. the NCSS requests were of the same type (both are grid subsets, or grid-as-point subset, or NCSS subsets on point collections),
  2. the requests were asking for netCDF3/netCDF4 files in return, and
  3. the requests were received nearly simultaneously, and their respective threads are able to call and enter the method requesting a temporary file within the same millisecond.

So while the temporary file-naming method of DiskCache2 was thread safe, it could not be considered timing safe. If the conditions above were met, it was possible that two threads servicing NCSS requests could be assigned the same temporary file name, leading to a race condition when writing the result to disk before returning the file to fulfill the request. If the amount of work needed to find the subset of data to fulfill the requests was well aligned, computationally speaking (roughly the same amount of cycles used, same I/O timing, etc.), it was possible for the server to respond to one or more of the requests with a netCDF file containing incorrect data.

Note that this issue could be exacerbated if multiple load-balanced TDS instances share cache directories. TDS cache directories should not be shared between instances, and doing so could lead to silent and unpredictable errors. Such sharing is not explicitly disallowed, however, and unintentional use of a shared cache directory can occur when using a common configuration file for multiple TDS instances. Administrators of TDS installations that fit the above description are encouraged to check the cache configuration carefully to ensure that each TDS instance is using a unique cache directory.

Impact

Three modes of failure have been identified. The first and most common mode of failure is that the server will return a corrupt netCDF file. The second most common failure mode is that the server would return a valid netCDF file that contained incorrect metadata, but correct data values. This is because the time at which the metadata is written to a netCDF file is the time when the file can most easily change in a non-corrupting way. The least frequent failure mode, and easily the worst case, can happen when the subset requests are nearly identical, differing only by time or location. In this case, the actual data values contained within the netCDF file could be incorrect.

Getting this Release

Users of the current supported version of the TDS are encouraged to upgrade to version 4.6.15, which includes an appropriate version of the netCDF-Java library. The v4.6.x maintenance line will continue to reside at https://github.com/Unidata/thredds. The Unidata managed TDS Docker container for this release can be found at https://github.com/Unidata/thredds-docker.

Users of the netCDF-Java library are encouraged to upgrade to version 5.3.3 of netCDF-Java, located at https://github.com/Unidata/netcdf-java.

Users of the beta-test version of the TDS are encouraged to upgrade to version 5.0.0-beta8, available at https://www.unidata.ucar.edu/downloads/tds/, (Docker container at https://github.com/Unidata/thredds-docker) alongside the upgrade to netCDF-Java version 5.3.3.

Special Thanks

The netCDF-Java and TDS development team would like to thank the teams at NASA's Land Processes Distributed Active Archive Center (LP DAAC), located at the U.S. Department of the Interior, U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) Center in Sioux Falls, SD, and at the Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) for Biogeochemical Dynamics for their efforts in identifying the issue and providing extensive testing information. The team from the LP DAAC (DAAC Manager Chris Torbert, lead scientist Tom Maiersperger, and Cole Krehdiel, Cody Hendrix, Robert Quenzer, and Aafaque Aafaque) originally discovered the issue while calling the ORNL DAAC TDS to access the Daymet data product, produced by the ORNL DAAC (http://daymet.ornl.gov/), from their Application for Extracting and Exploring Analysis Ready Samples (AppEEARS) platform (https://lpdaac.usgs.gov/tools/appeears/). They and their team provided initial testing scripts to ORNL DAAC to demonstrate the problem. From there, the team at ORNL DAAC (DAAC Manager Bruce Wilson, lead scientist Yaxing Wei, and Chris Lindsley, Ketan Patel), worked together to investigate the issue, expand upon the work of the LP DAAC team, and compile a report that was submitted to Unidata on June 12th, 2020. These testing scripts and reports greatly facilitated the discovery of the underlying bug and enabled Unidata to issue a fix and release shortly after receiving word of the issue. Both the ORNL DAAC and the LP DAAC are part of the NASA Earth Observing System Data and Information System (EOSDIS), funded by NASA. Unidata has benefited from a long-standing, collaborative relationship with NASA, and has NASA representation on its Strategic Advisory Committee. Chris Lynnes, an EOSDIS architect, currently serves in this role.

Comments:

Post a Comment:
Comments are closed for this entry.
Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
Browse by Topic
« March 2024
SunMonTueWedThuFriSat
     
1
2
3
5
6
7
8
9
10
11
12
13
14
15
16
17
19
20
21
22
23
24
25
26
27
28
29
30
31
      
Today