reading raw (packed) data from NetCDF files and avoiding missing-value check

Jon Blower jdb at mail.nerc-essc.ac.uk
Mon Oct 30 07:41:49 MST 2006


Hi John,

> are you aware of strided data acess, where you can read every 2nd or 10th point etc?

Yes I am thanks - but because requested CRSs don't generally dovetail
nicely with the data's native CRS this doesn't always work well.
However, it is certainly useful in some cases.

> sounds like you are finding the nearest neighbor in the large array, based on what points
> are needed in the your output CRS?

Exactly right, yes.

> i will also look at trying to make the converter faster.
> do you have any timing test output that would be instructive?

All my timing data is from the NetBeans profiler
(http://profiler.netbeans.org/), which I have found to be extremely
useful for identifying "hot spots" in code.  This produces "live"
graphical output so I'm afraid I don't have anything to send you.  I
could send some screenshots if you like but it would be much more
meaningful for you to run the profiler yourself if this is possible.

Thanks, Jon


On 27/10/06, John Caron <caron at unidata.ucar.edu> wrote:
>
>
> Jon Blower wrote:
> > Hi Don,
> >
> > The problem is caused by my use of the nj22 library.  In my
> > application I need to create an image from a NetCDF file as quickly as
> > possible.  The image will often be of much lower resolution than the
> > source data, but will not necessarily be in the same coordinate
> > reference system.
>
> are you aware of strided data acess, where you can read every 2nd or 10th point etc?
>
> sounds like you are finding the nearest neighbor in the large array, based on what points are needed in the your output CRS?
>
> >
> > If I want to create a 100x100 image, I need to read at least 10,000
> > data points.  However, reading 10,000 individual points appears to be
> > very slow (especially for an NcML aggregation) so I am compromising by
> > reading chunks of contiguous data at a time.  This means that I often
> > end up reading considerably more data than I need to make the image.
> > I perform the necessary interpolation in my application and throw away
> > the unwanted data.
> >
> > If I read packed data using an "enhanced" variable, then every single
> > point is internally checked to see if it is a missing value, and every
> > single point is unpacked (scale and offset applied).  Through
> > profiling, I established this to be an expensive operation because it
> > is being applied to many more data points than I need.  Therefore I
> > employed a method whereby data are read in their packed form, without
> > being checked for missing values.  I then perform the check just for
> > the 10,000 points that I need to plot in my image.  This is
> > considerably and demonstrably faster, although as with all
> > optimisation problems, it's a compromise.
> >
> > Does this clear things up?  As far as changes to the libraries go, it
> > would be handy to have a method in GeoGrid for reading "raw" (packed)
> > data as fast as possible, and giving the user the opportunity to
> > unpack the data later.
>
> that seems reasonable, i will see how easy it is to do.
> in any case, some fine-grained control is needed.
>
> i will also look at trying to make the converter faster.
> do you have any timing test output that would be instructive?
>
>
>
> >
> > Best wishes,
> > Jon
> >
> > On 27/10/06, Don Murray <dmurray at unidata.ucar.edu> wrote:
> >
> >> Jon and John-
> >>
> >> Why is it so much slower using the GeoGrid directly?  Perhaps
> >> there can be some performance tuning on the GeoGrid side to
> >> avoid people having to jump through the hoops that Jon is?
> >> Is it because the GeoGrid scales and offsets the entire grid
> >> before subsetting instead of subsetting and then scale and
> >> offset (which seems to be what Jon ends up doing).  Jon,
> >> when you say you are scaling and offsetting only the individual
> >> values, is this all the values in the subset or if not, what
> >> percentage of the subset are you doing this on?
> >>
> >> We've been doing some profiling of the netcdf-java reading
> >> in the IDV and if this is an area where we could get some
> >> performance enhancements, I'd like to implement something
> >> in the IDV.
> >>
> >> Don
> >>
> >> Jon Blower wrote:
> >> > Hi John (cc list),
> >> >
> >> > Thanks for you help - I found a solution that works well in my app.
> >> > As you suggested, I open the dataset without enhancement, then added
> >> > the coordinate systems:
> >> >
> >> >            nc = NetcdfDataset.openDataset(location, false, null);
> >> >            // Add the coordinate systems
> >> >            CoordSysBuilder.addCoordinateSystems(nc, null);
> >> >            GridDataset gd = new GridDataset(nc);
> >> >            GeoGrid geogrid = gd.findGridByName(varID);
> >> >
> >> > I then create an EnhanceScaleMissingImpl:
> >> >
> >> >            EnhanceScaleMissingImpl enhanced = new
> >> > EnhanceScaleMissingImpl((VariableDS)geogrid.getVariable());
> >> >
> >> > (Unfortunately this class is package-private so I made a copy from the
> >> > source code in my local directory.  Could this class be made public
> >> > please?)
> >> >
> >> > This means that when I read data using geogrid.subset() it does not
> >> > check for missing values or unpack the data and is therefore quicker.
> >> > I then do enhanced.convertScaleOffsetMissing() only on the individual
> >> > values I need to work with.  Seems to work well and is pretty quick.
> >> > Is there anything dangerous in the above?
> >> >
> >> > Thanks again,
> >> > Jon
> >> >
> >> >
> >> > On 26/10/06, John Caron <caron at unidata.ucar.edu> wrote:
> >> >> Hi Jon:
> >> >>
> >> >> Jon Blower wrote:
> >> >> > Hi John,
> >> >> >
> >> >> > I need some of the functionality of a GridDataset to allow me to
> >> read
> >> >> > coordinate system information.  Also, I might be opening an NcML
> >> >> > aggregation.  Is it sensible to use
> >> NetcdfDataset.getReferencedFile()?
> >> >> > In the case of an NcML aggregation, is it possible to get a
> >> handle to
> >> >> > a specific NetcdfFile (given relevant information such as the
> >> >> > timestep)?
> >> >>
> >> >> You are getting into the internals, so its a bit dangerous.
> >> >>
> >> >> I think this will work:
> >> >>
> >> >>  NetcdfDataset ncd = openDataset(String location, false, null); //
> >> >> dont enhance
> >> >>  ucar.nc2.dataset.CoordSysBuilder.addCoordinateSystems(ncd, null); //
> >> >> add coord info
> >> >>  GridDataset gds = new GridDataset( ncd); // make into a grid
> >> >>
> >> >> BTW, you will want to switch to the new GridDataset in
> >> >> ucar.nc2.dt.grid when you start using 2.2.17. It should be compatible,
> >> >> let me know.
> >> >>
> >> >>
> >> >> >
> >> >> > On a related note, is it efficient to read data from a NetcdfFile
> >> (or
> >> >> > NetcdfDataset) point-by-point?  I have been assuming that reading
> >> >> > contiguous chunks of data is more efficient than reading individual
> >> >> > points, even if it means reading more data than I actually need, but
> >> >> > perhaps this is not the case?  Unfortunately I'm not at my usual
> >> >> > computer so I can't do a quick check myself.  If reading data
> >> >> > point-by-point is efficient (enough) my problem goes away.
> >> >>
> >> >> It depends on data locality. If the points are close together on disk,
> >> >> then they will likely to be already in the random access file buffer.
> >> >> The bigger the buffer the more likely, you can try different buffer
> >> >> sizes with:
> >> >>
> >> >> NetcdfDataset openDataset(String location, boolean enhance, int
> >> >> buffer_size, ucar.nc2.util.CancelTask cancelTask, Object spiObject);
> >> >>
> >> >>
> >> >>
> >> >> >
> >> >> > Thanks, Jon
> >> >> >
> >> >> > On 26/10/06, John Caron <caron at unidata.ucar.edu> wrote:
> >> >> >
> >> >> >> Hi Jon:
> >> >> >>
> >> >> >> One obvious thing would be to open it as a NetcdfFile, not a
> >> >> >> GridDataset. Is that a possibility?
> >> >> >>
> >> >> >> Jon Blower wrote:
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > I'm writing an application that reads data from NetCDF files and
> >> >> >> > produces images.  I've noticed (through profiling) that a slow
> >> point
> >> >> >> > in the data reading process is the unpacking of packed data (i.e.
> >> >> >> > applying scale and offset) and checking for missing values.  I
> >> would
> >> >> >> > like to minimize the use of these calls.
> >> >> >> >
> >> >> >> > To cut a long post short, I would like to find a low-level
> >> function
> >> >> >> > that allows me to read the packed data, exactly as they appear in
> >> >> the
> >> >> >> > file.  I can then "manually" apply the unpacking and
> >> missing-value
> >> >> >> > checks only to those data points that I need to display.
> >> >> >> >
> >> >> >> > I'm using nj22, version 2.2.16.  I've tried reading data from
> >> >> >> > GeoGrid.subset() but this (of course) performs the unpacking.  I
> >> >> then
> >> >> >> > tried getting the "unenhanced" variable object through
> >> >> >> > GeoGrid.getVariable().getOriginalVariable(), but
> >> (unexpectedly) this
> >> >> >> > also seems to perform unpacking and missing-value checks - I
> >> >> expected
> >> >> >> > it to give raw data.
> >> >> >> >
> >> >> >> > Can anyone help me to find a function for reading raw (packed)
> >> data
> >> >> >> > without performing missing-value checks?
> >> >> >> >
> >> >> >> > Thanks in advance,
> >> >> >> > Jon
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >>
> >> ==============================================================================
> >>
> >> >>
> >> >> >>
> >> >> >> To unsubscribe netcdf-java, visit:
> >> >> >> http://www.unidata.ucar.edu/mailing-list-delete-form.html
> >> >> >>
> >> >>
> >> ==============================================================================
> >>
> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >> --
> >> *************************************************************
> >> Don Murray                               UCAR Unidata Program
> >> dmurray at unidata.ucar.edu                        P.O. Box 3000
> >> (303) 497-8628                              Boulder, CO 80307
> >> http://www.unidata.ucar.edu/staff/donm
> >> *************************************************************
> >>
> >>
> >>
> >
> >
>


-- 
--------------------------------------------------------------
Dr Jon Blower              Tel: +44 118 378 5213 (direct line)
Technical Director         Tel: +44 118 378 8741 (ESSC)
Reading e-Science Centre   Fax: +44 118 378 6413
ESSC                       Email: jdb at mail.nerc-essc.ac.uk
University of Reading
3 Earley Gate
Reading RG6 6AL, UK
--------------------------------------------------------------

==============================================================================
To unsubscribe netcdf-java, visit:
http://www.unidata.ucar.edu/mailing-list-delete-form.html
==============================================================================



More information about the Netcdf-java mailing list