Due to the current gap in continued funding from the U.S. National Science Foundation (NSF), the NSF Unidata Program Center has temporarily paused most operations. See NSF Unidata Pause in Most Operations for details.
Hello, A nice feature of the java netcdf library is it allows us to use a netcdf resource in the same way regardless of locality. Our code can use the same library calls to open a netcdf resource whether it is on a local filesystem or on a web server. The java netcdf library makes use of the HTTP Range header in the HTTPRandomAccessFile class. This means the whole netcf resource does not need to be read or downloaded prior to use. It seems the netcdf library handles the details quite well, requesting byte ranges similar to the way it would if the resource were on a local filesystem. One downside of the approach is the amount of heap memory allocated by default for each netcdf resource, especially in the case of forecasts spanning multiple netcdf resources greater than ten million bytes each. When attempting to open a single ensemble forecast composed of multiple netcdf resources (in this case seven members times sixty-eight timesteps) prior to reading values from them, an OutOfMemoryError is encountered with the following stack trace: java.lang.OutOfMemoryError: Java heap space at ucar.unidata.io.RandomAccessFile.init(RandomAccessFile.java:376) at ucar.unidata.io.RandomAccessFile.setBufferSize(RandomAccessFile.java:387) at ucar.unidata.io.http.HTTPRandomAccessFile.<init>(HTTPRandomAccessFile.java:98) at ucar.unidata.io.http.HTTPRandomAccessFile.<init>(HTTPRandomAccessFile.java:40) at ucar.nc2.NetcdfFile.getRaf(NetcdfFile.java:615) at ucar.nc2.NetcdfFile.open(NetcdfFile.java:506) at ucar.nc2.NetcdfFile.open(NetcdfFile.java:473) at ucar.nc2.NetcdfFile.open(NetcdfFile.java:458) at ucar.nc2.NetcdfFile.open(NetcdfFile.java:446) Two possibilities come to mind as to workarounds. First, allocate a larger heap. How much larger? Perhaps ten million bytes times seven times sixty-eight, around 4.5GiB more. But it does not seem right to require an additional 4.5GiB heap to simply open several resources and suppose the user is on a 32-bit system. Second, perhaps we could find a way to progressively open, read, and close each resource. This might be possible, but seems clunky and incorrect. We should be able to open all the resources, then access what is needed across them, then close them. In this case, a forecast spans multiple resources and the goal is to read a single forecast. A third option is to consider code in HTTPRandomAccessFile near the OutOfMemoryError and the variables involved. public class HTTPRandomAccessFile extends ucar.unidata.io.RandomAccessFile { public static final int defaultHTTPBufferSize = 20 * 1000; // 20K public static final int maxHTTPBufferSize = 10 * 1000 * 1000; // 10 M ... if (total_length > 0) { // this means that we will read the file in one gulp then deal with it in memory int useBuffer = (int) Math.min(total_length, maxHTTPBufferSize); // entire file size if possible useBuffer = Math.max(useBuffer, defaultHTTPBufferSize); // minimum buffer setBufferSize(useBuffer); } The effect of the Math.min and Math.max calls appears to cause a buffer size of ten million bytes to be allocated for each netcdf resource greater than or equal to ten million bytes. Experimentation shows that there are more requests made when this maxHTTPBufferSize is reduced, but the OutOfMemoryError is avoided. The version control history shows it used to only use the twenty thousand value, not ten million. Is there any significance to the ten million byte buffer size? Would you be willing to make a new default of two hundred thousand or perhaps offer a Java System Property option to configure the value at JVM launch time? For example, -Ducar.unidata.io.http.maxHTTPBufferSize=200000. It is preferable to use a build tool to fetch the ucar-built and ucar-tested cdm artifact to get the latest and greatest updates rather than maintain a fork. I have not experimented with the setting with any datasets other than this narrow use case, so I also wonder about the impact on other uses. All I can see from experimentation is that a trade-off is made between request/response overhead on the one hand (higher when set lower) and data volume on the other hand (higher when set higher). The trace above is with cdm-5.1.0.jar, from the ucar artifact repository (fetched with gradle), with sha256sum d211d2b040aa1d63bc3a6898bb27f55fb116f743dee4572b7f1228e9d4cf37f1. Thank you for your consideration, Jesse Bickel Contractor, ERT, Inc. Federal Affiliation: NWC/OWP/NOAA/DOC
netcdf-java
archives: