[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20050302: netCDF General - Large file problem



>To: address@hidden
>From: "Jim Cowie" <address@hidden>
>Subject: netCDF General - Large file problem
>Organization: RAL
>Keywords: 200503022003.j22K3ZjW025634

Jim,

You wrote:

> I will include a CDL of the file below. As far as I can tell, the
> structure of these variables should allow a large file version,
> according to the restrictions on large files under the
> 3.5documentation.

It looks like the structure of your CDL violates one of the
restrictions for large files under the "classic" format, described in
the 3.5 documentation, although the netCDF library should have
returned an error when you tried to define the variable "cprob_snow"
that is the first variable to violate the format constraints:

  There are important constraints on the structure of large netCDF
  files that result from the 32-bit relative offsets that are part of
  the netCDF file format:

  If you don't use the unlimited dimension, only one variable can
  exceed 2 Gbytes in size, but it can be as large as the underlying
  file system permits. It must be the last variable in the dataset,
  and the offset to the beginning of this variable must be less than
  about 2 Gbytes. For example, the structure of the data might be
  something like:

   netcdf bigfile1 {
      dimensions: 
         x=2000;
         y=5000;
         z=10000;
      variables:
         double x(x);         // coordinate variables
         double y(y);
         double z(z);
         double var(x, y, z); // 800 Gbytes
      }

From your CDL, the offset to the beginning of the cprob_snow variable
is at least the size of the header plus the size of the data arrays
for all the previous variables.  Ignoring the size of the header, I
just added up the sizes of the preceding data arrays with the help of
a little python script.  In units of bytes, these work out to

          0     type_bytes
          4     forc_time
         12     creation_time
         20     num_sites
         24     site_list
       9224     T
  211977224     max_T
  264969224     min_T
  317961224     dewpt
  529929224     wind_u
  741897224     wind_v
  953865224     wind_speed
 1165833224     cloud_cov
 1377801224     visibility
 1589769224     prob_fog
 1801737224     prob_thunder
 2013705224     cprob_rain
 2225673224     cprob_snow
 2437641224     cprob_ice
 2649609224     prob_precip06
 2861577224     prob_precip24
 2914569224     qpf06

so the last five variables all have offsets larger than 2**31 =
2147483648.  The netCDF 3.5.0 library should have returned an error
when you tried to define cprob_snow with an nf_def_var() call, and if
it didn't, that's a bug.  I verified that the 3.6.0 library does
return an error in this case, as it should,

So this data can't all be stored in a classic format netCDF file, but
the good news is that it can probably be stored and accessed fine in a
64-bit offset format netCDF file, and it may be possible to fix the
headers of the CDF1 files to make them CDF2 files and copy the data
into the CDF2 files with no data loss.  

It's also possible that the 3.5.0 library wrote values for the last
five variables in the wrong place, in which case the data could not be
recovered.  With a little work, I think I could convert the file you
provided to a CDF2 file and dump out the first few values of each of
the last five variables.  If I did that, could you tell if they looked
right?  Do you have a lot of files in this form, so that it would be
worth trying to recover the data in this way?

--Russ

netcdf gfs00_dmos_emp.20050217.0040 {
dimensions:
        max_site_num = 2300 ;
        num_eqns = 30 ;
        var_regressors = 3 ;
        days = 16 ;
        fc_times_per_day = 4 ;
        daily_time = 1 ;
        weight_vals = 4 ;
variables:
        int type ;
                type:long_name = "cdl file type" ;
        double forc_time ;
                forc_time:long_name = "time of earliest forecast" ;
                forc_time:units = "seconds since 1970-1-1 00:00:00" ;
        double creation_time ;
                creation_time:long_name = "time at which forecast file was 
created" ;
                creation_time:units = "seconds since 1970-1-1 00:00:00" ;
        int num_sites ;
                num_sites:long_name = "number of actual_sites" ;
        int site_list(max_site_num) ;
                site_list:long_name = "forecast site list" ;
        float T(max_site_num, days, fc_times_per_day, num_eqns, var_regressors, 
weight_vals) ;
                T:long_name = "temperature" ;
                T:units = "Celsius" ;
        float max_T(max_site_num, days, daily_time, num_eqns, var_regressors, 
weight_vals) ;
                max_T:long_name = "maximum temperature" ;
                max_T:units = "Celsius" ;
        float min_T(max_site_num, days, daily_time, num_eqns, var_regressors, 
weight_vals) ;
                min_T:long_name = "minimum temperature" ;
                min_T:units = "Celsius" ;
        float dewpt(max_site_num, days, fc_times_per_day, num_eqns, 
var_regressors, weight_vals) ;
                dewpt:long_name = "dewpoint" ;
                dewpt:units = "Celsius" ;
        float wind_u(max_site_num, days, fc_times_per_day, num_eqns, 
var_regressors, weight_vals) ;
                wind_u:long_name = "u-component of wind" ;
                wind_u:units = "meters per second" ;
        float wind_v(max_site_num, days, fc_times_per_day, num_eqns, 
var_regressors, weight_vals) ;
                wind_v:long_name = "v-component of wind" ;
                wind_v:units = "meters per second" ;
        float wind_speed(max_site_num, days, fc_times_per_day, num_eqns, 
var_regressors, weight_vals) ;
                wind_speed:long_name = "wind speed" ;
                wind_speed:units = "meters per second" ;
        float cloud_cov(max_site_num, days, fc_times_per_day, num_eqns, 
var_regressors, weight_vals) ;
                cloud_cov:long_name = "cloud cover" ;
                cloud_cov:units = "percent*100" ;
        float visibility(max_site_num, days, fc_times_per_day, num_eqns, 
var_regressors, weight_vals) ;
                visibility:long_name = "visibility" ;
                visibility:units = "km" ;
        float prob_fog(max_site_num, days, fc_times_per_day, num_eqns, 
var_regressors, weight_vals) ;
                prob_fog:long_name = "probability of fog" ;
                prob_fog:units = "percent*100" ;
        float prob_thunder(max_site_num, days, fc_times_per_day, num_eqns, 
var_regressors, weight_vals) ;
                prob_thunder:long_name = "probability of thunder" ;
                prob_thunder:units = "percent*100" ;
        float cprob_rain(max_site_num, days, fc_times_per_day, num_eqns, 
var_regressors, weight_vals) ;
                cprob_rain:long_name = "conditional probability of rain" ;
                cprob_rain:units = "percent*100" ;
        float cprob_snow(max_site_num, days, fc_times_per_day, num_eqns, 
var_regressors, weight_vals) ;
                cprob_snow:long_name = "conditional probability of snow" ;
                cprob_snow:units = "percent*100" ;
        float cprob_ice(max_site_num, days, fc_times_per_day, num_eqns, 
var_regressors, weight_vals) ;
                cprob_ice:long_name = "conditional probability of ice" ;
                cprob_ice:units = "percent*100" ;
        float prob_precip06(max_site_num, days, fc_times_per_day, num_eqns, 
var_regressors, weight_vals) ;
                prob_precip06:long_name = "probability of precipitation, 6 hr" ;
                prob_precip06:units = "percent*100" ;
        float prob_precip24(max_site_num, days, daily_time, num_eqns, 
var_regressors, weight_vals) ;
                prob_precip24:long_name = "probability of precipitation, 24 hr" 
;
                prob_precip24:units = "percent*100" ;
        float qpf06(max_site_num, days, fc_times_per_day, num_eqns, 
var_regressors, weight_vals) ;
                qpf06:long_name = "amount of precipitation" ;
                qpf06:units = "mm" ;
data:
}