Re: [netcdf-java] Errors reading certain NetCDF4 data

  • To: Ryan May <rmay@xxxxxxxx>
  • Subject: Re: [netcdf-java] Errors reading certain NetCDF4 data
  • From: Christian Ward-Garrison <cwardgar@xxxxxxxx>
  • Date: Thu, 19 Feb 2015 15:34:17 -0700
Hi Antonio,

Our point about data layout stands, but if you still want to see what
performance benefits you can get by rechunking, I think you should use a
different shape. In his follow-up blog [1], Russ Rew provides a Python
function that calculates a good 3D chunk shape for your read pattern:

print chunk_shape_3D([5088,103,122], 4, 4096)
[21, 6, 8]

Maybe give that a try instead?

Cheers,
Christian

[1]
http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes

On Thu, Feb 19, 2015 at 12:45 PM, Ryan May <rmay@xxxxxxxx> wrote:

> Antonio,
>
> Sorry, I mispoke--time *should* be the last dimension, since for
> C-ordering, the last dimension will vary the fastest (i.e. items along this
> dimension will be sequential in memory). (I then got that crossed-up with
> your chunking description, which you're correct about.)
>
> It's possible for chunking to make up some of the performance difference,
> but you're never going to be as fast as just re-ordering the data. Russ
> Rew's example quoted times with chunking going from 200 seconds to 1.4
> seconds; his example had about 20x the amount of times he was getting.
> Given that you're quoting times of less than 1 second, I wonder if you're
> just not dominated by the seek time. Certainly, since you're on an SSD, the
> penalties for non-sequential access are much less than for disks.
>
> Ryan
>
> On Thu, Feb 19, 2015 at 11:48 AM, Antonio Rodriges <antonio.rrz@xxxxxxxxx>
> wrote:
>
>> Ryan,
>>
>> I do have time my first dimension (Christian suggested for time being
>> the last dimension)
>> and thought that after rechunking I get smth like this:
>>
>> 4x4 (lat and lon 2D array located continuously on disk), 4x4, 4x4,
>> 4x4, ......, 4x4
>> <<---------------------------- the number of rasters is 512
>> ---------------------------->>
>> so the distance between the different dates is not 8 kb but should be
>> only 4 x 4 x sizeof(float) = 64 bytes for the expected layout
>>
>> Here is the metadata (although without chunk sizes, is it possible to
>> look at the sizes?):
>>
>> netcdf
>> file:/d:/RS_DATA/worker/merra_ts/tavg1_2d_slv_Nx/wind_australia_chunked/u10m/chunked/
>> 2014_ch.nc
>> {
>>  dimensions:
>>    latitude = 103;
>>    longitude = 122;
>>    time = UNLIMITED;   // (5088 currently)
>>  variables:
>>    double latitude(latitude=103);
>>      :_Netcdf4Dimid = 0; // int
>>      :units = "degrees_north";
>>      :long_name = "Latitude";
>>    double longitude(longitude=122);
>>      :_Netcdf4Dimid = 1; // int
>>      :units = "degrees_east";
>>      :long_name = "Longitude";
>>    double time(time=5088);
>>      :_Netcdf4Dimid = 2; // int
>>      :units = "hours since 2014-1-1 0";
>>    float u10m(time=5088, latitude=103, longitude=122);
>>      :comments = "Unknown1 variable comment";
>>      :long_name = "Eastward wind at 10 m above displacement height";
>>      :units = "m s-1";
>>      :grid_name = "grid-1";
>>      :grid_type = "linear";
>>      :level_description = "Earth surface";
>>      :time_statistic = "instantaneous";
>>      :missing_value = 9.9999999E14f; // float
>>
>>  :Conventions = "COARDS";
>>  :calendar = "standard";
>>  :comments = "file created by grads using lats4d available from
>> http://dao.gsfc.nasa.gov/software/grads/lats4d/";;
>>  :model = "geos/das";
>>  :center = "gsfc";
>>  :history = "Mon Dec 01 20:20:48 2014:
>>
>> D:\\DATA\\worker\\merra_ts\\tavg1_2d_slv_Nx\\wind_australia\\u10m\\ncks.exe
>> -4 --cnk_dmn lat,4 --cnk_dmn lon,4 --cnk_dmn time,512 2014.nc
>> 2014_ch.nc\\nWed Oct 15 20:26:23 2014: ncrcat -v u10m -o 2014.nc";
>>  :nco_openmp_thread_number = 1; // int
>>  :nco_input_file_number = 212; // int
>>  :NCO = "20141201";
>> }
>>
>> 2015-02-19 21:24 GMT+03:00 Ryan May <rmay@xxxxxxxx>:
>> > Antonio,
>> >
>> > Even with that chunk size, the number of bytes between consecutive
>> points in
>> > time is 512 x 4 x sizeof(float), which is 8 kb. You may get a few points
>> > closer together, but they're still not close together. Any read ahead
>> > function of the disk will be throwing away 99% of the data if all you
>> want
>> > is all the time for a single point.
>> >
>> > If you're predominant access pattern is all times for a single point,
>> your
>> > best throughput will be achieved by making sure that those points are
>> > consecutive on disk, which means that you should have time be the first
>> > dimension, not the last. Anything else you do will be papering over the
>> core
>> > problem.
>> >
>> > Ryan
>> >
>> > On Thu, Feb 19, 2015 at 10:37 AM, Antonio Rodriges <
>> antonio.rrz@xxxxxxxxx>
>> > wrote:
>> >>
>> >> Christian,
>> >>
>> >> According to Russ Rew
>> >>
>> >>
>> http://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters
>> >> the chunking must help for my access pattern
>> >>
>> >> After rechunking I expected to have chunks with 512x4x4 sizes where
>> >> values for the single point and different time should be stored very
>> >> close on disk
>> >
>> >
>> >
>> >
>> > --
>> > Ryan May
>> > Software Engineer
>> > UCAR/Unidata
>> > Boulder, CO
>>
>
>
>
> --
> Ryan May
> Software Engineer
> UCAR/Unidata
> Boulder, CO
>
  • 2015 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdf-java archives: