[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Support #CUV-251255]: Nccopy extremly slow / hangs



Mark,

> Your suggestion for changing the size of the time-chunking proved to be 
> extremely sensible - doing the whole thing (including removing the record 
> dimension and compression) using a chunk size of 
> time/566,longitude/24,latitude/25 took 11 minutes on my 8GB machine:
> 
> ./nccopy -u -k3 -m 1G -h 4G -e 65000 -c time/566,longitude/24,latitude/25 
> combined_snapshots.nc temporal_read_optimised.nc
> 600.81user 58.12system 11:04.90elapsed 99%CPU (0avgtext+0avgdata 
> 0maxresident)k
> 1201712inputs+1235864outputs (252major+25217140minor)pagefaults 0swaps
> 
> I dare not declare victory, but at least I have something now that works 
> well....

Great, I'm glad that worked.

> I was thinking that it could be useful to provide some guidance in the 
> documentation (e.g. man pages) about how one should choose the cache 
> parameters and how they relate to the available RAM. It's great that nccopy 
> can do this type of rechunking on memory limited machines (the nco tools 
> cannot), but as I've found out, setting the parameters correctly is a bit of 
> a challenge! Having some "rules of thumb" written down somewhere would be 
> very useful!

Yes, I agree we need that.  It also helps to have real use cases such
as you provided to help see what's needed and to discover what works.
For a while we've known we need better guidance on chunking and
compression, especially with subtleties of setting the chunk cache
appropriately.  Now that nccopy provides a way to do this without
programming, I think we should be able to finally work out some good
recommendations.  Eventually, I wuold like for nccopy to figure out
how much chunk cache to allocate and how many chunk cache elements to
specify so users don't have to be bothered with the details ...

For now, I've entered another comment on this task in our Jira
database, where you can follow its progress if you're interested:

   https://www.unidata.ucar.edu/jira/browse/NCF-73

--Russ

> Thanks for all the help.
> 
> Cheers,
> 
> Mark
> 
> 
> 
> 
> ________________________________________
> Fra: Unidata netCDF Support [address@hidden]
> Sendt: 15. august 2011 22:12
> Til: Mark Payne
> Cc: address@hidden; Mark Payne
> Emne: [Support #CUV-251255]: Nccopy extremly slow / hangs
> 
> > Just to give you an example of what its doing, I logged the file size for 
> > the command below as a function of time. see attached figure. It continues 
> > at this low speed until I grow impatient.... :-)
> 
> That doesn't look like O(n**2) behavior for a bad algorithm, it looks like 
> thrashing
> behavior when close to resource exhaustion (in this case memory).  I'll be 
> interested
> if you can reproduce the examples I tried with "nccopy -m 1G -h 2G ...", 
> which should
> use enough less than your 8 GB machine has to not incur thrashing, unless 
> your machine
> is doing lots of other processing ...
> 
> --Russ
> 
> >
> > Hi Mark,
> >
> > > After a holiday and a break from this work, I was finally able to have a 
> > > look at it again. Unfortunately, the fix doesn't seem to work for me :-( 
> > > It is still the same problem as previously - the copy starts out fine, 
> > > but gets progressively slower and slower, and ultimately "hangs". Here is 
> > > the command that I am using:
> > >
> > > ./nccopy -u -k3 -d1 -m 2G -h 18G -e 10001 -c 
> > > time/1698,longitude/6,latitude/7 combined_snapshots.nc 
> > > temporal_read_optimised.nc
> > >
> > > I am wondering whether I am setting the -h and -e options correctly? How 
> > > should these be set? I'm not sure I understand the difference between 
> > > them.
> >
> > It looks to me like you are setting -h correctly.  Since you are using 
> > smaller sizes for the longitude and latitude dimensions than I
> > used in my tests (6 and 7 versus 24 and 25), you will have about 14.3 times 
> > as many chunks as I used (I was aiming at each chunk
> > being about 4 MB), so you could set the number of elements in the chunk 
> > cache higher (61446 instead of 4301).  I had used 10001
> > elements in the chunk cache to be generously larger than 4301, but I think 
> > it's not too critical as long as the number of elements
> > in the chunk cache is larger than the number of cache elements you need.  
> > Since you are compressing the data and reordering it in
> > a way that requires *all* the chunks in memory at once, you need to use at 
> > least "-e 61446", and to be generous should probably
> > use something like "-e 61500".  The HDF5 documentation recommends that the 
> > number of elements in the chunk cache should be
> > prime, but I don't see the necessity for that and haven't noticed any 
> > difference whether it's prime or composite.  With the current
> > setting of "-e 1001", chunks that are only partly written will have to be 
> > ejected from the cache to make room for new chunks, and
> > this will lead to lots of unnecessary recompressing of chunks that are 
> > ejected before writing them to disk, as well as uncompressing
> > partially written chunks when reading them into the chunk cache.
> >
> > You also need to make sure that your computer has enough memory to hold the 
> > chunk cache in memory.  You've specified a 2GB
> > input buffer and 18GB of chunk cache memory, so you should have at least 
> > 20GB of memory for nccopy to run,  keeping
> > the data in the chunk cache uncompressed while reordering it.  You might 
> > get by with a smaller input buffer, say 11MB (one time of
> > 1617*1596*4 bytes) and a somewhat smaller chunk cache, "-h 17.53G", if 
> > you're close to the maximum.
> >
> > > The combined_snapshots.nc file is 630MB - a dump of the header is given 
> > > below:
> >
> > My tests have been with simulated data of the same size as you're using, 
> > but my simulated data may compress better than yours.
> > If you could possibly make your actual combined_snapshots.nc file available 
> > somewhere for me to test nccopy on the actual data,
> > I could make sure I can reproduce something like the 15 minute times I'm 
> > seeing for the copy and rechunking.  It may be your
> > use of 1698x7x6 chunks requires more time than the larger 1698x25x24 chunks 
> > I was writing, so I could try that as well.
> >
> > > Any ideas?
> >
> > I really can't explain what looks like the O(n**2) behavior you seem to be 
> > seeing in writing the output, unless it's something in
> > the HDF5 layer involving a performance bug in the B-trees that index the 
> > chunks.  You can't really judge the progress in writing
> > the output file by the size of the output, as none of the chunks are 
> > complete until the end of the copy.  So the output file should
> > stay fairly small until all of the chunks are flushed to disk (while being 
> > compressed) at the end of the rechunking.
> >
> > Also the -h and -e options to nccopy have only been minimally tested, and 
> > there could still be bugs ...
> >
> > --Russ
> >
> > > [mpayne@oleander compiler]$ ncdump combined_snapshots.nc -h -c
> > > netcdf combined_snapshots {
> > > dimensions:
> > > latitude = 1617 ;
> > > longitude = 1596 ;
> > > time = UNLIMITED ; // (1698 currently)
> > > variables:
> > > float chl_oc5(time, latitude, longitude) ;
> > > chl_oc5:_FillValue = 0.f ;
> > > chl_oc5:long_name = "Chlorophyll-a concentration in sea water using the 
> > > OC5 algorithm" ;
> > > chl_oc5:standard_name = 
> > > "mass_concentration_of_chlorophyll_a_in_sea_water" ;
> > > chl_oc5:grid_mapping = "mercator" ;
> > > chl_oc5:units = "milligram m-3" ;
> > > chl_oc5:missing_value = 0.f ;
> > > chl_oc5:units_nonstandard = "mg m^-3" ;
> > > float latitude(latitude) ;
> > > latitude:_FillValue = -999.f ;
> > > latitude:standard_name = "latitude" ;
> > > latitude:long_name = "latitude" ;
> > > latitude:valid_min = -90. ;
> > > latitude:units = "degrees_north" ;
> > > latitude:valid_max = 90. ;
> > > latitude:axis = "Y" ;
> > > float longitude(longitude) ;
> > > longitude:_FillValue = -999.f ;
> > > longitude:standard_name = "longitude" ;
> > > longitude:long_name = "longitude" ;
> > > longitude:valid_min = -180. ;
> > > longitude:units = "degrees_east" ;
> > > longitude:valid_max = 180. ;
> > > longitude:axis = "X" ;
> > > int mercator ;
> > > mercator:false_easting = 0L ;
> > > mercator:standard_parallel = 0L ;
> > > mercator:grid_mapping_name = "mercator" ;
> > > mercator:false_northing = 0L ;
> > > mercator:longitude_of_projection_origin = 0L ;
> > > double time(time) ;
> > > time:_FillValue = -1. ;
> > > time:time_origin = "1970-01-01 00:00:00" ;
> > > time:valid_min = 0. ;
> > > time:long_name = "time" ;
> > > time:standard_name = "time" ;
> > > time:units = "seconds since 1970-01-01 00:00:00" ;
> > > time:calendar = "gregorian" ;
> > > time:axis = "T" ;
> > >
> > > // global attributes:
> > > :site_name = "UK Shelf Seas" ;
> > > :citation = "If you use this data towards any publication, please 
> > > acknowledge this using: \'The authors thank the NERC Earth Observation 
> > > Data Acquisition and Analysis Service (NEODAAS) for supplying data for 
> > > this study\' and then email NEODAAS (address@hidden) with the details. 
> > > The service relies on users\' publications as one measure of success." ;
> > > :creation_date = "Thu Jun 02 10:51:37 2011" ;
> > > :easternmost_longitude = 13. ;
> > > :creator_url = "http://rsg.pml.ac.uk"; ;
> > > :references = "See NEODAAS webpages at http://www.neodaas.ac.uk/ or RSG 
> > > pages at http://rsg.pml.ac.uk/"; ;
> > > :Metadata_Conventions = "Unidata Dataset Discovery v1.0" ;
> > > :keywords = "satellite,observation,ocean" ;
> > > :summary = "This data is Level-3 satellite observation data (Level 3 
> > > meaning raw observations processedto geophysical quantities, and placed 
> > > onto a regular grid)." ;
> > > :id = 
> > > "M2010001.1235.uk.postproc_products.MYO.01jan101235.v1.20111530951.data.nc"
> > >  ;
> > > :naming_authority = "uk.ac.pml" ;
> > > :geospatial_lat_max = 62.999108 ;
> > > :title = "Level-3 satellite data from Moderate Resolution Imaging 
> > > Spectroradiometer sensor" ;
> > > :source = "Moderate Resolution Imaging Spectroradiometer" ;
> > > :northernmost_latitude = 62.999108 ;
> > > :creator_name = "Plymouth Marine Laboratory Remote Sensing Group" ;
> > > :processing_level = "Level-3 (NASA EOS Conventions)" ;
> > > :creator_email = "address@hidden" ;
> > > :netcdf_library_version = "4.0.1 of Sep  3 2010 11:27:29 $" ;
> > > :date_issued = "Thu Jun 02 10:51:37 2011" ;
> > > :geospatial_lat_min = 47. ;
> > > :date_created = "Thu Jun 02 10:51:37 2011" ;
> > > :institution = "Plymouth Marine Laboratory Remote Sensing Group" ;
> > > :geospatial_lon_max = 13. ;
> > > :geospatial_lon_min = -15. ;
> > > :contact1 = "email: address@hidden" ;
> > > :license = "If you use this data towards any publication, please 
> > > acknowledge this using: \'The authors thank the NERC Earth Observation 
> > > Data Acquisition and Analysis Service (NEODAAS) for supplying data for 
> > > this study\' and then email NEODAAS (address@hidden) with the details. 
> > > The service relies on users\' publications as one measure of success." ;
> > > :Conventions = "CF-1.4" ;
> > > :project = "NEODAAS (NERC Earth Observation Data Acquisition and Analysis 
> > > Service)" ;
> > > :cdm_data_type = "Grid" ;
> > > :RSG_sensor = "MODIS" ;
> > > :westernmost_longitude = -15. ;
> > > :RSG_areacode = "uk" ;
> > > :southernmost_latitude = 47. ;
> > > :netcdf_file_type = "NETCDF4_CLASSIC" ;
> > > :history = "Created during RSG Standard Mapping (Mapper) [SGE Job Number: 
> > > 2577153]" ;
> > > :NCO = "4.0.7" ;
> > > }
> > > [mpayne@oleander compiler]$
> > >
> > >
> >
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: CUV-251255
> > Department: Support netCDF
> > Priority: Normal
> > Status: Closed
> >
> >
> >
> 
> Russ Rew                                         UCAR Unidata Program
> address@hidden                      http://www.unidata.ucar.edu
> 
> 
> 
> Ticket Details
> ===================
> Ticket ID: CUV-251255
> Department: Support netCDF
> Priority: Normal
> Status: Closed
> 
> 
> Hi Russ,
> 
> Your suggestion for changing the size of the time-chunking proved to be 
> extremely sensible - doing the whole thing (including removing the record 
> dimension and compression) using a chunk size of 
> time/566,longitude/24,latitude/25 took 11 minutes on my 8GB machine:
> 
> ./nccopy -u -k3 -m 1G -h 4G -e 65000 -c time/566,longitude/24,latitude/25 
> combined_snapshots.nc temporal_read_optimised.nc
> 600.81user 58.12system 11:04.90elapsed 99%CPU (0avgtext+0avgdata 
> 0maxresident)k
> 1201712inputs+1235864outputs (252major+25217140minor)pagefaults 0swaps
> 
> I dare not declare victory, but at least I have something now that works 
> well....
> 
> I was thinking that it could be useful to provide some guidance in the 
> documentation (e.g. man pages) about how one should choose the cache 
> parameters and how they relate to the available RAM. It's great that nccopy 
> can do this type of rechunking on memory limited machines (the nco tools 
> cannot), but as I've found out, setting the parameters correctly is a bit of 
> a challenge! Having some "rules of thumb" written down somewhere would be 
> very useful!
> 
> Thanks for all the help.
> 
> Cheers,
> 
> Mark
> 
> 
> 
> 
> ________________________________________
> Fra: Unidata netCDF Support [address@hidden]
> Sendt: 15. august 2011 22:12
> Til: Mark Payne
> Cc: address@hidden; Mark Payne
> Emne: [Support #CUV-251255]: Nccopy extremly slow / hangs
> 
> > Just to give you an example of what its doing, I logged the file size for 
> > the command below as a function of time. see attached figure. It continues 
> > at this low speed until I grow impatient.... :-)
> 
> That doesn't look like O(n**2) behavior for a bad algorithm, it looks like 
> thrashing
> behavior when close to resource exhaustion (in this case memory).  I'll be 
> interested
> if you can reproduce the examples I tried with "nccopy -m 1G -h 2G ...", 
> which should
> use enough less than your 8 GB machine has to not incur thrashing, unless 
> your machine
> is doing lots of other processing ...
> 
> --Russ
> 
> >
> > Hi Mark,
> >
> > > After a holiday and a break from this work, I was finally able to have a 
> > > look at it again. Unfortunately, the fix doesn't seem to work for me :-( 
> > > It is still the same problem as previously - the copy starts out fine, 
> > > but gets progressively slower and slower, and ultimately "hangs". Here is 
> > > the command that I am using:
> > >
> > > ./nccopy -u -k3 -d1 -m 2G -h 18G -e 10001 -c 
> > > time/1698,longitude/6,latitude/7 combined_snapshots.nc 
> > > temporal_read_optimised.nc
> > >
> > > I am wondering whether I am setting the -h and -e options correctly? How 
> > > should these be set? I'm not sure I understand the difference between 
> > > them.
> >
> > It looks to me like you are setting -h correctly.  Since you are using 
> > smaller sizes for the longitude and latitude dimensions than I
> > used in my tests (6 and 7 versus 24 and 25), you will have about 14.3 times 
> > as many chunks as I used (I was aiming at each chunk
> > being about 4 MB), so you could set the number of elements in the chunk 
> > cache higher (61446 instead of 4301).  I had used 10001
> > elements in the chunk cache to be generously larger than 4301, but I think 
> > it's not too critical as long as the number of elements
> > in the chunk cache is larger than the number of cache elements you need.  
> > Since you are compressing the data and reordering it in
> > a way that requires *all* the chunks in memory at once, you need to use at 
> > least "-e 61446", and to be generous should probably
> > use something like "-e 61500".  The HDF5 documentation recommends that the 
> > number of elements in the chunk cache should be
> > prime, but I don't see the necessity for that and haven't noticed any 
> > difference whether it's prime or composite.  With the current
> > setting of "-e 1001", chunks that are only partly written will have to be 
> > ejected from the cache to make room for new chunks, and
> > this will lead to lots of unnecessary recompressing of chunks that are 
> > ejected before writing them to disk, as well as uncompressing
> > partially written chunks when reading them into the chunk cache.
> >
> > You also need to make sure that your computer has enough memory to hold the 
> > chunk cache in memory.  You've specified a 2GB
> > input buffer and 18GB of chunk cache memory, so you should have at least 
> > 20GB of memory for nccopy to run,  keeping
> > the data in the chunk cache uncompressed while reordering it.  You might 
> > get by with a smaller input buffer, say 11MB (one time of
> > 1617*1596*4 bytes) and a somewhat smaller chunk cache, "-h 17.53G", if 
> > you're close to the maximum.
> >
> > > The combined_snapshots.nc file is 630MB - a dump of the header is given 
> > > below:
> >
> > My tests have been with simulated data of the same size as you're using, 
> > but my simulated data may compress better than yours.
> > If you could possibly make your actual combined_snapshots.nc file available 
> > somewhere for me to test nccopy on the actual data,
> > I could make sure I can reproduce something like the 15 minute times I'm 
> > seeing for the copy and rechunking.  It may be your
> > use of 1698x7x6 chunks requires more time than the larger 1698x25x24 chunks 
> > I was writing, so I could try that as well.
> >
> > > Any ideas?
> >
> > I really can't explain what looks like the O(n**2) behavior you seem to be 
> > seeing in writing the output, unless it's something in
> > the HDF5 layer involving a performance bug in the B-trees that index the 
> > chunks.  You can't really judge the progress in writing
> > the output file by the size of the output, as none of the chunks are 
> > complete until the end of the copy.  So the output file should
> > stay fairly small until all of the chunks are flushed to disk (while being 
> > compressed) at the end of the rechunking.
> >
> > Also the -h and -e options to nccopy have only been minimally tested, and 
> > there could still be bugs ...
> >
> > --Russ
> >
> > > [mpayne@oleander compiler]$ ncdump combined_snapshots.nc -h -c
> > > netcdf combined_snapshots {
> > > dimensions:
> > > latitude = 1617 ;
> > > longitude = 1596 ;
> > > time = UNLIMITED ; // (1698 currently)
> > > variables:
> > > float chl_oc5(time, latitude, longitude) ;
> > > chl_oc5:_FillValue = 0.f ;
> > > chl_oc5:long_name = "Chlorophyll-a concentration in sea water using the 
> > > OC5 algorithm" ;
> > > chl_oc5:standard_name = 
> > > "mass_concentration_of_chlorophyll_a_in_sea_water" ;
> > > chl_oc5:grid_mapping = "mercator" ;
> > > chl_oc5:units = "milligram m-3" ;
> > > chl_oc5:missing_value = 0.f ;
> > > chl_oc5:units_nonstandard = "mg m^-3" ;
> > > float latitude(latitude) ;
> > > latitude:_FillValue = -999.f ;
> > > latitude:standard_name = "latitude" ;
> > > latitude:long_name = "latitude" ;
> > > latitude:valid_min = -90. ;
> > > latitude:units = "degrees_north" ;
> > > latitude:valid_max = 90. ;
> > > latitude:axis = "Y" ;
> > > float longitude(longitude) ;
> > > longitude:_FillValue = -999.f ;
> > > longitude:standard_name = "longitude" ;
> > > longitude:long_name = "longitude" ;
> > > longitude:valid_min = -180. ;
> > > longitude:units = "degrees_east" ;
> > > longitude:valid_max = 180. ;
> > > longitude:axis = "X" ;
> > > int mercator ;
> > > mercator:false_easting = 0L ;
> > > mercator:standard_parallel = 0L ;
> > > mercator:grid_mapping_name = "mercator" ;
> > > mercator:false_northing = 0L ;
> > > mercator:longitude_of_projection_origin = 0L ;
> > > double time(time) ;
> > > time:_FillValue = -1. ;
> > > time:time_origin = "1970-01-01 00:00:00" ;
> > > time:valid_min = 0. ;
> > > time:long_name = "time" ;
> > > time:standard_name = "time" ;
> > > time:units = "seconds since 1970-01-01 00:00:00" ;
> > > time:calendar = "gregorian" ;
> > > time:axis = "T" ;
> > >
> > > // global attributes:
> > > :site_name = "UK Shelf Seas" ;
> > > :citation = "If you use this data towards any publication, please 
> > > acknowledge this using: \'The authors thank the NERC Earth Observation 
> > > Data Acquisition and Analysis Service (NEODAAS) for supplying data for 
> > > this study\' and then email NEODAAS (address@hidden) with the details. 
> > > The service relies on users\' publications as one measure of success." ;
> > > :creation_date = "Thu Jun 02 10:51:37 2011" ;
> > > :easternmost_longitude = 13. ;
> > > :creator_url = "http://rsg.pml.ac.uk"; ;
> > > :references = "See NEODAAS webpages at http://www.neodaas.ac.uk/ or RSG 
> > > pages at http://rsg.pml.ac.uk/"; ;
> > > :Metadata_Conventions = "Unidata Dataset Discovery v1.0" ;
> > > :keywords = "satellite,observation,ocean" ;
> > > :summary = "This data is Level-3 satellite observation data (Level 3 
> > > meaning raw observations processedto geophysical quantities, and placed 
> > > onto a regular grid)." ;
> > > :id = 
> > > "M2010001.1235.uk.postproc_products.MYO.01jan101235.v1.20111530951.data.nc"
> > >  ;
> > > :naming_authority = "uk.ac.pml" ;
> > > :geospatial_lat_max = 62.999108 ;
> > > :title = "Level-3 satellite data from Moderate Resolution Imaging 
> > > Spectroradiometer sensor" ;
> > > :source = "Moderate Resolution Imaging Spectroradiometer" ;
> > > :northernmost_latitude = 62.999108 ;
> > > :creator_name = "Plymouth Marine Laboratory Remote Sensing Group" ;
> > > :processing_level = "Level-3 (NASA EOS Conventions)" ;
> > > :creator_email = "address@hidden" ;
> > > :netcdf_library_version = "4.0.1 of Sep  3 2010 11:27:29 $" ;
> > > :date_issued = "Thu Jun 02 10:51:37 2011" ;
> > > :geospatial_lat_min = 47. ;
> > > :date_created = "Thu Jun 02 10:51:37 2011" ;
> > > :institution = "Plymouth Marine Laboratory Remote Sensing Group" ;
> > > :geospatial_lon_max = 13. ;
> > > :geospatial_lon_min = -15. ;
> > > :contact1 = "email: address@hidden" ;
> > > :license = "If you use this data towards any publication, please 
> > > acknowledge this using: \'The authors thank the NERC Earth Observation 
> > > Data Acquisition and Analysis Service (NEODAAS) for supplying data for 
> > > this study\' and then email NEODAAS (address@hidden) with the details. 
> > > The service relies on users\' publications as one measure of success." ;
> > > :Conventions = "CF-1.4" ;
> > > :project = "NEODAAS (NERC Earth Observation Data Acquisition and Analysis 
> > > Service)" ;
> > > :cdm_data_type = "Grid" ;
> > > :RSG_sensor = "MODIS" ;
> > > :westernmost_longitude = -15. ;
> > > :RSG_areacode = "uk" ;
> > > :southernmost_latitude = 47. ;
> > > :netcdf_file_type = "NETCDF4_CLASSIC" ;
> > > :history = "Created during RSG Standard Mapping (Mapper) [SGE Job Number: 
> > > 2577153]" ;
> > > :NCO = "4.0.7" ;
> > > }
> > > [mpayne@oleander compiler]$
> > >
> > >
> >
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: CUV-251255
> > Department: Support netCDF
> > Priority: Normal
> > Status: Closed
> >
> >
> >
> 
> Russ Rew                                         UCAR Unidata Program
> address@hidden                      http://www.unidata.ucar.edu
> 
> 
> 
> Ticket Details
> ===================
> Ticket ID: CUV-251255
> Department: Support netCDF
> Priority: Normal
> Status: Closed
> 
> 

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: CUV-251255
Department: Support netCDF
Priority: Normal
Status: Closed