[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #UGC-492931]: IO Overhead when reading small subsets from large Tilled Files



Hi Julian,

> I am currently trying to optimize a Global Modell.
> The Modell reads small chunks (501x501) from lots (One for each day) of
> Global Datasets (40000x20000) (Netcdf: 4.2)
> These Dataset are compressed NetCDFs with a tilling activated (100x100).
> (See output of gdalinfo attached)

The gdalinfo output doesn't show what format is used for the 
netCDF file, but from what you said, I'm assuming it's netCDF-4
or netCDF-4 classic model format.  If you run "ncdump -s -h" on
the file, it would show the chunk shapes (which I'm assuming 
are 100 by 100) as well as the file format.  From the gdalinfo
output, I also assume the data is 32-bit floating-point, but
it doesn't show which kind of compression is used. If it's zlib
compression, it would be useful to know what deflation level
was used, which would also be in the ncdump -s -h output.

> However when I measure the File-IO via NFS i get a Factor of ~10
> compared to the uncompressed Output image when testing with
> gdal_translate(which uses netcdf internally) .

By a factor of 10, do you mean 10 times slower?

Measuring I/O times can be tricky. Are you clearing the disk
cache between timing runs? Are you measuring wall-clock time
or actual I/O time? The time to read and uncompress 36 100x100 
chunks of data for each 501x501 block may be dominated by CPU
time to uncompress the chunks, rather than I/O time. If you
are measuring wall-clock time, reading compressed data might
be a factor of 10 times slower than reading uncompressed data.

> Inside the Fortran-Modell using the netCDF library directly (code see
> below)  i measure an even worst Factor of ~60 compared to compressed
> outputs). This is better than using untiled Inputs (measured with
> gdal_tranlate) where the overhad was ~80x, but still a larger overhead
> than I expected.
> I tested it using gdal_translate in.nc out.tif -srcwin 6000 6000 500 500

I'm not sure what you're measuring. Are you saying reading chunked
but uncompressed data is 60 times slower than reading chunked and
compressed data? But that reading unchunked (and compressed?) data
was 80 times slower than reading uncompressed data? Sorry, but it's
not clear to me exactly what you are comparing for these timings.

> Is there anything that can be done to decrease this Overhead?

We'll need more information to answer that question.  Would it be
possible to make available a sample file and a shell script that
demonstrates the timing differences? That way we could make sure
there is no cacheing going on and see exactly what you're 
measuring.

> For a global Modell Run the Overhead would add up to about 300TB*

Do you mean that the difference between making the data available in
compressed rather than uncompressed form would be 300TB? It would be
more useful to know the relative size (e.g. 200TB instead of 500TB)
rather than the difference.

> *I am thankful for any suggestions on how to reduce this overhead.

I hope we can help, given more specific information about the
problem ...

--Russ

> Reading the file(s) from inside the Modell via:
> 
> file_name = 'myfile.nc'
> 
> start(1) = 6000
> 
> start(2) = 6000
> 
> count(1) = 501
> 
> count(2) = 501
> 
> nf90_open(file_name, NF90_NOWRITE, ncid)
> 
> nf90_inq_varid(ncid, 'LAI', varid)
> 
> nf90_get_var(ncid, varid, lai_in, start(1:2), count(1:2))
> 
> nf90_close(ncid)
> 
> 
> 
> 
> 
> 
> 
> Gdalinfo:
> 
> Driver: netCDF/Network Common Data Format
> 
> Files: LAI_Y2014_C10.nc
> 
> Size is 40320, 20160
> 
> Coordinate System is `'
> 
> Origin = (-179.999993460545994,90.000001090156800)
> 
> Pixel Size = (0.008928571482636,-0.008928571915164)
> 
> Metadata:
> 
> LAI#Fill_Value=-999
> 
> LAI#long_name=Leaf Area Index
> 
> LAI#units=m2/m2
> 
> LAI#valid_range={1,10}
> 
> latitude#long_name=Latitude
> 
> latitude#units=degrees_north
> 
> longitude#long_name=Longitude
> 
> longitude#units=degrees_east
> 
> NC_GLOBAL#comment=email:address@hidden
> 
> NC_GLOBAL#Conventions=CF-1.4
> 
> NC_GLOBAL#institution=DLR-DFD
> 
> NC_GLOBAL#title=10 Day LAI Composite based on Geoland2
> 
> Corner Coordinates:
> 
> Upper Left (-179.9999935,  90.0000011)
> 
> Lower Left (-179.9999935, -90.0000087)
> 
> Upper Right ( 180.0000087,  90.0000011)
> 
> Lower Right ( 180.0000087, -90.0000087)
> 
> Center ( 0.0000076,  -0.0000038)
> 
> Band 1 Block=100x100 Type=Float32, ColorInterp=Undefined
> 
> NoData Value=-999
> 
> Metadata:
> 
> Fill_Value=-999
> 
> long_name=Leaf Area Index
> 
> NETCDF_VARNAME=LAI
> 
> units=m2/m2
> 
> valid_range={1,10}
> 
> 
> ------------------------------------------------------------------------**
> 
> *Deutsches Zentrum für Luft- und Raumfahrt*(DLR)
> 
> German Aerospace Center
> 
> Earth Observation Center | German Remote Sensing Data Center | Land
> Surface | Oberpfaffenhofen | 82234 Wessling| Germany
> 
> Julian Zeidler
> 
> Telephone +49 8153 28-1229 | Telefax +49 8153 28-1445
> |address@hidden <mailto:address@hidden>
> 
> www.DLR.de/eoc <http://www.dlr.de/eoc>
> 
> 
> 
> 
> 

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: UGC-492931
Department: Support netCDF
Priority: Normal
Status: Closed