Re: [netcdf-java] newbie question on NetCDF file overhead

Hi Jeff,

Looks like you are writing netCDF-4 files. Given that, I believe the
problem is the default chunking that is used when an unlimited dimension
is involved. From an old email conversation on the netcdfgroup email
list [1] which references a netCDF-C chunking document [2], it sounds
like each record along the unlimited dimension is a single chunk. Since
each of your records is so small, that's a lot of overhead and likely
the culprit in the larger than expected files you are seeing.

In your code below, try the version of NetcdfFileWriter.createWriter()
that has a Nc4Chunking parameter. Looks like the easiest approach is to
use the "standard" strategy

> Nc4ChunkingStrategyImpl.factory(Nc4Chunking.Strategy.standard, 0, false)

and add "_ChunkSize" attributes to your variables with a single integer
value since you only have one dimension. I'm not a chunking expert but
maybe start with a value of 2000. I've included a few links to some blog
posts on chunking and compression ([3], [4], and [5]) which discuss
choosing chunk sizes.

Hope that helps.

Ethan


[1]
https://www.unidata.ucar.edu/mailing_lists/archives/netcdfgroup/2012/msg00005.html

[2]
http://www.unidata.ucar.edu/software/netcdf/docs/netcdf-c/nc_005fdef_005fvar_005fchunking.html#nc_005fdef_005fvar_005fchunking

[3]
http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters

[4]
http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes

[5] http://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf_compression

On 4/30/2014 5:09 PM, Jeff Johnson - NOAA Affiliate wrote:
> Sorry, correction - raw data = 300000 bytes, so NetCDF is 8x larger.
> 
> jeff
> 
> 
> On Wed, Apr 30, 2014 at 4:37 PM, Jeff Johnson - NOAA Affiliate
> <jeff.m.johnson@xxxxxxxx <mailto:jeff.m.johnson@xxxxxxxx>> wrote:
> 
>     Hi all-
> 
>     I'm working on generating my first NetCDF files and have a question.
>     The files I'm creating seem to be far larger than I would have
>     thought necessary to hold the given data. I'm wondering if there is
>     something I can do to trim this down a bit.
> 
>     Our data is simple time-series data (one unlimited dimension). Below
>     is a simple Java test program that generates a file with 10000
>     records, each of which contains a 24-character timestamp string and
>     three 2-byte values. This gives a raw data requirement of 30000
>     bytes. The generated NetCDF file is 2420656 bytes, or 80x larger. Is
>     this what is expected?  In my development with real data I'm seeing
>     7MB of data creating an 86MB NetCDF file, etc. It seems to settle
>     out at about 12x as the data sets grow, which is still pretty
>     onerous. Any insights or suggestions appreciated.
> 
>     package gov.noaa.swpc.solarwind;
> 
>     import org.joda.time.DateTime;
>     import ucar.ma2.ArrayShort;
>     import ucar.ma2.ArrayString;
>     import ucar.ma2.DataType;
>     import ucar.ma2.InvalidRangeException;
>     import ucar.nc2.*;
> 
>     import java.io.IOException;
>     import java.nio.file.FileSystems;
>     import java.nio.file.Files;
>     import java.nio.file.Path;
>     import java.util.ArrayList;
>     import java.util.List;
> 
>     public class TestGenFile {
>       public static void main(String[] args) {
>         DateTime startDate = new DateTime();
>         DateTime endDate = startDate.plusDays(1);
> 
>         NetcdfFileWriter dataFile = null;
> 
>         try {
>           try {
> 
>             // define the file
>             String filePathName = "output.nc <http://output.nc>";
> 
>             // delete the file if it already exists
>             Path path = FileSystems.getDefault().getPath(filePathName);
>             Files.deleteIfExists(path);
> 
>             // enter definition mode for this NetCDF-4 file
>             dataFile =
>     NetcdfFileWriter.createNew(NetcdfFileWriter.Version.netcdf4,
>     filePathName);
> 
>             // create the root group
>             Group rootGroup = dataFile.addGroup(null, null);
> 
>             // define the global attributes
>             dataFile.addGroupAttribute(rootGroup, new
>     Attribute("startDate", startDate.toString()));
>             dataFile.addGroupAttribute(rootGroup, new
>     Attribute("endDate", endDate.toString()));
> 
>             // define dimensions, in this case only one: time
>             Dimension timeDim = dataFile.addUnlimitedDimension("time");
>             List<Dimension> dimList = new ArrayList<>();
>             dimList.add(timeDim);
> 
>             // define variables
>             Variable time = dataFile.addVariable(rootGroup, "time",
>     DataType.STRING, dimList);
>             dataFile.addVariableAttribute(time, new
>     Attribute("standard_name", "time"));
> 
>             Variable bx = dataFile.addVariable(rootGroup, "bx",
>     DataType.SHORT, dimList);
>             dataFile.addVariableAttribute(bx, new Attribute("long_name",
>     "IMF Bx"));
>             dataFile.addVariableAttribute(bx, new Attribute("units",
>     "raw counts"));
> 
>             Variable by = dataFile.addVariable(rootGroup, "by",
>     DataType.SHORT, dimList);
>             dataFile.addVariableAttribute(by, new Attribute("long_name",
>     "IMF By"));
>             dataFile.addVariableAttribute(by, new Attribute("units",
>     "raw counts"));
> 
>             Variable bz = dataFile.addVariable(rootGroup, "bz",
>     DataType.SHORT, dimList);
>             dataFile.addVariableAttribute(bz, new Attribute("long_name",
>     "IMF Bz"));
>             dataFile.addVariableAttribute(bz, new Attribute("units",
>     "raw counts"));
> 
>             // create the file
>             dataFile.create();
> 
>             // create 1-D arrays to hold data values (time is the dimension)
>             ArrayString timeArray = new ArrayString.D1(1);
>             ArrayShort.D1 bxArray = new ArrayShort.D1(1);
>             ArrayShort.D1 byArray = new ArrayShort.D1(1);
>             ArrayShort.D1 bzArray = new ArrayShort.D1(1);
> 
>             int[] origin = new int[]{0};
> 
>             // write the records to the file
>             for (int i = 0; i < 10000; i++) {
>               // load data into array variables
>               timeArray.setObject(timeArray.getIndex(), new
>     DateTime().toString());
>               bxArray.set(0, (short) i);
>               byArray.set(0, (short) (i * 2));
>               bzArray.set(0, (short) (i * 3));
> 
>               origin[0] = i;
> 
>               // write a record
>               dataFile.write(time, origin, timeArray);
>               dataFile.write(bx, origin, bxArray);
>               dataFile.write(by, origin, byArray);
>               dataFile.write(bz, origin, bzArray);
>             }
>           } finally {
>             if (null != dataFile) {
>               // close the file
>               dataFile.close();
>             }
>           }
>         } catch (IOException | InvalidRangeException e) {
>           e.printStackTrace();
>         }
>       }
>     }
> 
>     thanks,
>     jeff
> 
>     -- 
>     Jeff Johnson
>     DSCOVR Ground System Development
>     Space Weather Prediction Center
>     jeff.m.johnson@xxxxxxxx <mailto:jeff.m.johnson@xxxxxxxx>
> 
> 
> 
> 
> -- 
> Jeff Johnson
> DSCOVR Ground System Development
> Space Weather Prediction Center
> jeff.m.johnson@xxxxxxxx <mailto:jeff.m.johnson@xxxxxxxx>
> 303-497-6260
> 
> 
> _______________________________________________
> netcdf-java mailing list
> netcdf-java@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe, visit: 
> http://www.unidata.ucar.edu/mailing_lists/ 
> 

-- 
Ethan Davis                                       UCAR Unidata Program
edavis@xxxxxxxxxxxxxxxx                    http://www.unidata.ucar.edu



  • 2014 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdf-java archives: