Re: [netcdf-java] OutOfMemoryError when opening an ensemble forecast dataset (many netcdf resources) via HTTP

  • To: Jesse Bickel - NOAA Affiliate <jesse.bickel@xxxxxxxx>
  • Subject: Re: [netcdf-java] OutOfMemoryError when opening an ensemble forecast dataset (many netcdf resources) via HTTP
  • From: John Caron <jcaron1129@xxxxxxxxx>
  • Date: Wed, 23 Oct 2019 09:54:27 -0600
Hi Jesse:

Are you doing an NcML Aggregation?
Do you know if your HTTP server (where the files live) supports range
Can you send something to reproduce?


On Thu, Oct 17, 2019 at 4:39 PM Jesse Bickel - NOAA Affiliate via
netcdf-java <netcdf-java@xxxxxxxxxxxxxxxx> wrote:

> Hello,
> A nice feature of the java netcdf library is it allows us to use a netcdf
> resource in the same way regardless of locality. Our code can use the same
> library calls to open a netcdf resource whether it is on a local filesystem
> or on a web server.
> The java netcdf library makes use of the HTTP Range header in the
> HTTPRandomAccessFile class. This means the whole netcf resource does not
> need to be read or downloaded prior to use. It seems the netcdf library
> handles the details quite well, requesting byte ranges similar to the way
> it would if the resource were on a local filesystem.
> One downside of the approach is the amount of heap memory allocated by
> default for each netcdf resource, especially in the case of forecasts
> spanning multiple netcdf resources greater than ten million bytes each.
> When attempting to open a single ensemble forecast composed of multiple
> netcdf resources (in this case seven members times sixty-eight timesteps)
> prior to reading values from them, an OutOfMemoryError is encountered with
> the following stack trace:
> java.lang.OutOfMemoryError: Java heap space
>         at
> .RandomAccessFile.init(
>         at
> .RandomAccessFile.setBufferSize(
>         at
>         at
>         at ucar.nc2.NetcdfFile.getRaf(
>         at
>         at
>         at
>         at
> Two possibilities come to mind as to workarounds. First, allocate a larger
> heap. How much larger? Perhaps ten million bytes times seven times
> sixty-eight, around 4.5GiB more. But it does not seem right to require an
> additional 4.5GiB heap to simply open several resources and suppose the
> user is on a 32-bit system. Second, perhaps we could find a way to
> progressively open, read, and close each resource. This might be possible,
> but seems clunky and incorrect. We should be able to open all the
> resources, then access what is needed across them, then close them. In this
> case, a forecast spans multiple resources and the goal is to read a single
> forecast.
> A third option is to consider code in HTTPRandomAccessFile near the
> OutOfMemoryError and the variables involved.
> public class HTTPRandomAccessFile extends
> {
>   public static final int defaultHTTPBufferSize = 20 * 1000; // 20K
>   public static final int maxHTTPBufferSize = 10 * 1000 * 1000; // 10 M
> ...
>     if (total_length > 0) {
>       // this means that we will read the file in one gulp then deal with
> it in memory
>       int useBuffer = (int) Math.min(total_length, maxHTTPBufferSize); //
> entire file size if possible
>       useBuffer = Math.max(useBuffer, defaultHTTPBufferSize); // minimum
> buffer
>       setBufferSize(useBuffer);
>     }
> The effect of the Math.min and Math.max calls appears to cause a buffer
> size of ten million bytes to be allocated for each netcdf resource greater
> than or equal to ten million bytes.
> Experimentation shows that there are more requests made when this
> maxHTTPBufferSize is reduced, but the OutOfMemoryError is avoided.
> The version control history shows it used to only use the twenty thousand
> value, not ten million.
> Is there any significance to the ten million byte buffer size?
> Would you be willing to make a new default of two hundred thousand or
> perhaps offer a Java System Property option to configure the value at JVM
> launch time? For example,
> It is preferable to use a build tool to fetch the ucar-built and
> ucar-tested cdm artifact to get the latest and greatest updates rather than
> maintain a fork.
> I have not experimented with the setting with any datasets other than this
> narrow use case, so I also wonder about the impact on other uses. All I can
> see from experimentation is that a trade-off is made between
> request/response overhead on the one hand (higher when set lower) and data
> volume on the other hand (higher when set higher).
> The trace above is with cdm-5.1.0.jar, from the ucar artifact repository
> (fetched with gradle), with sha256sum
> d211d2b040aa1d63bc3a6898bb27f55fb116f743dee4572b7f1228e9d4cf37f1.
> Thank you for your consideration,
> Jesse Bickel
> Contractor, ERT, Inc.
> Federal Affiliation: NWC/OWP/NOAA/DOC
> _______________________________________________
> NOTE: All exchanges posted to Unidata maintained email lists are
> recorded in the Unidata inquiry tracking system and made publicly
> available through the web.  Users who post to any of the lists we
> maintain are reminded to remove any personal information that they
> do not want to be made public.
> netcdf-java mailing list
> netcdf-java@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe, visit:
  • 2019 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdf-java archives: