[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #NJK-505013]: netCDF size blowup for variable-length variables



Hello,

I am not certain why this is happening, although I have a few guesses. You 
mention the CDL is 229 MB; does it compress to a manageable size that can be 
sent to us?  Is it possible to subset the CDL in a way that will let you email 
it to me, so that I can investigate this further?

I have a few guesses as to why this is happening, but I cannot say for certain 
that it is not a bug. How are you converting this CDL to binary netCDF4? Is it 
possible to turn off fill values and see if this has an effect on the file size?

As I sit here trying to hypothesize what is happening, I think ideally I would 
need to play around with a sample data set before I could assert anything 
concrete.  My immediate thought is that the since data is being stored in 
gridded fashion under the hood, the resultant sparse grid is being populated 
with large fillvalues.  If we were able to get a representative sample binary 
file (or a CDL file from which I could generate such a representative sample), 
I would be able to determine more.

If you were to subset the data file for even a handful of records, I would be 
most grateful; thank you in advance!

-Ward



> Dear sir/madam,
> 
> my name is Christian Asker and I work at the research department of the 
> Swedish Meteorological and Hydrological Institute (SMHI).
> 
> We use netCDF as main storage format in our Air-Quality-Management system 
> CLAIR. Usually it works very well, but we have recently found that when 
> storing variable-length variables, the size of the resulting netCDF file is 
> unreasonably large.
> 
> Since the file is so large, I cannot send it easily, but here is the header 
> part (from ncdump):
> 
> 
> netcdf emission_PM10_mean_brake_dir_line_road_hours {
> 
> types:
> 
> float(*) line_vlen_type ;
> 
> dimensions:
> 
> time = UNLIMITED ; // (1 currently)
> 
> bounds_dim = 2 ;
> 
> road_features = UNLIMITED ; // (1214227 currently)
> 
> road_static_params = UNLIMITED ; // (7 currently)
> 
> static_param_name_len = 30 ;
> 
> variables:
> 
> int EPSG_3006 ;
> 
> EPSG_3006:srid = 3006LL ;
> 
> EPSG_3006:crs_wkt = "PROJCS[\"SWEREF99 
> TM\",GEOGCS[\"SWEREF99\",DATUM[\"SWEREF99\",SPHEROID[\"GRS 
> 1980\",6378137,298.257222101,AUTHORITY[\"EPSG\",\"7019\"]],AUTHORITY[\"EPSG\",\"6619\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AUTHORITY[\"EPSG\",\"4619\"]],PROJECTION[\"Transverse_Mercator\"],PARAMETER[\"latitude_of_origin\",0],PARAMETER[\"central_meridian\",15],PARAMETER[\"scale_factor\",0.9996],PARAMETER[\"false_easting\",500000],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Northing\",NORTH],AXIS[\"Easting\",EAST],AUTHORITY[\"EPSG\",\"3006\"]]"
>  ;
> 
> double time(time) ;
> 
> time:calendar = "gregorian" ;
> 
> time:units = "hours since 1970-01-01 00:00:00" ;
> 
> time:bounds = "time_bounds" ;
> 
> time:long_name = "time" ;
> 
> time:axis = "T" ;
> 
> double time_bounds(time, bounds_dim) ;
> 
> line_vlen_type road_geometry(road_features) ;
> 
> int64 road_feature_id(road_features) ;
> 
> float emission_PM10_mean_brake_dir_line_road_hours(time, road_features) ;
> 
> emission_PM10_mean_brake_dir_line_road_hours:long_name = "Emission PM10, 
> mean, brake_dir, line, road, hours" ;
> 
> emission_PM10_mean_brake_dir_line_road_hours:feature_ids = "road_feature_id" ;
> 
> emission_PM10_mean_brake_dir_line_road_hours:parameter = "emission_PM10" ;
> 
> emission_PM10_mean_brake_dir_line_road_hours:_FillValue = NaNf ;
> 
> emission_PM10_mean_brake_dir_line_road_hours:cell_methods = "mean" ;
> 
> emission_PM10_mean_brake_dir_line_road_hours:substance = "PM10" ;
> 
> emission_PM10_mean_brake_dir_line_road_hours:units = "kg/s" ;
> 
> emission_PM10_mean_brake_dir_line_road_hours:static_data = "road_static_data" 
> ;
> 
> emission_PM10_mean_brake_dir_line_road_hours:grid_mapping = "EPSG_3006" ;
> 
> emission_PM10_mean_brake_dir_line_road_hours:geometry = "road_geometry" ;
> 
> emission_PM10_mean_brake_dir_line_road_hours:instance = "brake_dir" ;
> 
> emission_PM10_mean_brake_dir_line_road_hours:feature_type = "road" ;
> 
> emission_PM10_mean_brake_dir_line_road_hours:quantity = "emission" ;
> 
> emission_PM10_mean_brake_dir_line_road_hours:variable_type = "line" ;
> 
> double road_static_data(road_features, road_static_params) ;
> 
> road_static_data:_FillValue = NaN ;
> 
> char road_static_params(road_static_params, static_param_name_len) ;
> 
> road_static_params:_FillValue = "n" ;
> 
> 
> // global attributes:
> 
> :Created_using = "0.12.0.dev0" ;
> 
> :history = "20230731 15:04 Created dataset" ;
> 
> }
> 
> So, we are storing 1214227 objects (roads in this case) for a single 
> time-step and for each road feature we store the geometry, which is of 
> variable length (arrays of coordinate points).
> When dumping the netCDF to text (cdl), the size is about 229 MB, however the 
> netCDF file is 14 GB! We cannot understund why this happens, since we would 
> expect the netCDF file size to be smaller than the corresponding cdl-file.
> 
> We have tried with a few different versions of netCDF (4.4 and 4.8) but the 
> issue remains.
> 
> We mainly use netCDF throught the python netCDF4-module and/or Xarray. For 
> the most cases, where we have timeseries of gridded data, it works very well, 
> it is only when storing varying-length vector-type of data where we have 
> problems.
> 
> Is there anything we can do when creating the netCDF to avoid this "blowup" 
> in size? Is it a bug or is it expected behaviour?
> 
> 
> If needed, we can make the full netCDF-file in the example above available 
> for you to download.
> 
> Best regards,
> Christian Asker
> 
> 
> 
> --
> 
> Christian Asker
> 
> Air Quality Researcher
> Ph. D. Physics
> Phone: +46 11 495 8645
> E-mail: address@hidden<mailto:address@hidden>
> 
> Research Department
> Swedish Meteorological and Hydrological Institute
> SE-601 76 Norrköping
> 


Ticket Details
===================
Ticket ID: NJK-505013
Department: Support netCDF
Priority: Normal
Status: Closed
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.