Re: [netcdfgroup] nf90_char size

First of all thanks for all the advices .

On 02/05/20 19:01, Dave Allured - NOAA Affiliate wrote:
There it is.

> DATASET "BSE_RESONANT_COMPRESSED1_DONE" {
>       DATATYPE  H5T_STRING {
>          STRSIZE 1;
>          STRPAD H5T_STR_NULLTERM;
>          CSET H5T_CSET_UTF8;
>          CTYPE H5T_C_S1;

Your char arrays are being stored as strings, not 1-byte characters.  This incurs overhead for each character.

Ok, I see from the other email that this is not the issue.
Dimension scales are using the extra space
I'll try to generate the equivalent file on a much smaller simulation and let you now if this happens also on a smaller dataset.

I got from Wei-Keng answer that this could be a bug of the version of netcdf which I'm using and that I should upgrade to 4.7.4.
Is that right ?

I am not familiar with the exact details of physical storage of HDF5 strings, but it doesn't matter. This scheme is inefficient, and you should find something better.

I vaguely recall some changes in netcdf-4 character storage in recent years.  Since you are using an older version of the netcdf library, first try the latest version.

Let me take the opportunity for a couple of questions.
The reason why we have such huge character variable is that we use it as a control variable. Basically we have simulations where a huge complex matrix is created and filled. Sometimes the code may crash during the simulation. The idea is that we write a character "t"  for each complex number stored to disk. (We do not store numbers one by one, but we cannot use blocks since the way the matrix is blocked is depends on the parallelization scheme).

1) Now at first we thought: it would be great if, after the interrupted run, the call nf90_get_var could check which values are filled and which are not.
Let's say my netcdf variable is
 1.23423, 4.3452 , 5.3453, 7.34534, _, _, _, ...
i.e. only the first 4 values where computed and we need to restart from the 5th, But we did not figure out a way. So first question is: is there a way to check that ? There are nf90_def_var_fill and nf90_inq_var_fill, but I'm not sure I can use it.
Maybe we could use a control number for that.
Say first filling the matrix with zeros, but it means that we have to write twice the whole matrix to disk.
Maybe we should go back to this idea ...

2) The second though was: Let's use logical.
We need to store just 1 bit of information nearby each complex number (64 bits in single precision), it is not too much. But we got netcdf does not have 1 bit variable, just 1 byte, so we ended up using nf90_char


Otherwise, as I said elsewhere, go to 64-bit or CDF5 formats.

Here another question. Our code started with netcdf.
Then we evolved to parallel I/O and the only way we found was via HDF5.
1) is there any alternative ?
2) silly question for a netcdf mailing list: netcdf is essentially a layer on top of HDF5 for the way we use it. We are discussing in our developers team to simply drop netcdf and go straight to HDF5. Any reason why we shouldn't do that in your opinion ?


If netcdf-4 is important to you for some reason, you might also consider encoding your char data into signed or unsigned bytes.

Ok, I'll try using a nf90_byte instead of nf90_char (is this what you suggest?)




On Sat, May 2, 2020 at 10:38 AM Davide Sangalli <davide.sangalli@xxxxxx <mailto:davide.sangalli@xxxxxx>> wrote:

    h5stat -Ss ndb.BS_COMPRESS0.005000_Q1

    Filename: ndb.BS_COMPRESS0.005000_Q1
    Free-space section threshold: 1 bytes
    Small size free-space sections (< 10 bytes):
    Total # of small size sections: 0
    Free-space section bins:
    Total # of sections: 0
    File space management strategy: H5F_FILE_SPACE_ALL
    Summary of file space information:
    File metadata: 4355 bytes
    Raw data: 16356758312 bytes
    Amount/Percent of tracked free space: 0 bytes/0.0%
    Unaccounted space: 6216 bytes
    Total space: 16356768883 bytes



    On Sat, May 2, 2020 at 6:28 PM +0200, "Davide Sangalli"
    <davide.sangalli@xxxxxx <mailto:davide.sangalli@xxxxxx>> wrote:

        h5dump -Hp ndb.BS_COMPRESS0.005000_Q1
        HDF5 "ndb.BS_COMPRESS0.005000_Q1" {
        GROUP "/" {
        ATTRIBUTE "_NCProperties" {
        DATATYPE  H5T_STRING {
        STRSIZE 57;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1;
        }
        DATASPACE  SCALAR
        }
        DATASET "BSE_RESONANT_COMPRESSED1" {
        DATATYPE  H5T_IEEE_F32LE
        DATASPACE  SIMPLE { ( 24776792, 2 ) / ( 24776792, 2 ) }
        STORAGE_LAYOUT {
        CONTIGUOUS
        SIZE 198214336
        OFFSET 16158554547
        }
        FILTERS {
        NONE
        }
        FILLVALUE {
        FILL_TIME H5D_FILL_TIME_IFSET
        VALUE  9.96921e+36
        }
        ALLOCATION_TIME {
        H5D_ALLOC_TIME_EARLY
        }
        ATTRIBUTE "DIMENSION_LIST" {
        DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
        DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
        }
        }
        DATASET "BSE_RESONANT_COMPRESSED1_DONE" {
        DATATYPE  H5T_STRING {
        STRSIZE 1;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_UTF8;
        CTYPE H5T_C_S1;
        }
        DATASPACE  SIMPLE { ( 2025000000 ) / ( 2025000000 ) }
        STORAGE_LAYOUT {
        CONTIGUOUS
        SIZE 2025000000
        OFFSET 8100002379
        }
        FILTERS {
        NONE
        }
        FILLVALUE {
        FILL_TIME H5D_FILL_TIME_IFSET
        VALUE  ""
        }
        ALLOCATION_TIME {
        H5D_ALLOC_TIME_EARLY
        }
        ATTRIBUTE "DIMENSION_LIST" {
        DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
        DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
        }
        }
        DATASET "BSE_RESONANT_COMPRESSED2_DONE" {
        DATATYPE  H5T_STRING {
        STRSIZE 1;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_UTF8;
        CTYPE H5T_C_S1;
        }
        DATASPACE  SIMPLE { ( 2025000000 ) / ( 2025000000 ) }
        STORAGE_LAYOUT {
        CONTIGUOUS
        SIZE 2025000000
        OFFSET 10125006475
        }
        FILTERS {
        NONE
        }
        FILLVALUE {
        FILL_TIME H5D_FILL_TIME_IFSET
        VALUE  ""
        }
        ALLOCATION_TIME {
        H5D_ALLOC_TIME_EARLY
        }
        ATTRIBUTE "DIMENSION_LIST" {
        DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
        DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
        }
        }
        DATASET "BSE_RESONANT_COMPRESSED3_DONE" {
        DATATYPE  H5T_STRING {
        STRSIZE 1;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_UTF8;
        CTYPE H5T_C_S1;
        }
        DATASPACE  SIMPLE { ( 781887360 ) / ( 781887360 ) }
        STORAGE_LAYOUT {
        CONTIGUOUS
        SIZE 781887360
        OFFSET 15277557963
        }
        FILTERS {
        NONE
        }
        FILLVALUE {
        FILL_TIME H5D_FILL_TIME_IFSET
        VALUE  ""
        }
        ALLOCATION_TIME {
        H5D_ALLOC_TIME_EARLY
        }
        ATTRIBUTE "DIMENSION_LIST" {
        DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
        DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
        }
        }
        DATASET "BS_K_compressed1" {
        DATATYPE  H5T_IEEE_F32BE
        DATASPACE  SIMPLE { ( 24776792 ) / ( 24776792 ) }
        STORAGE_LAYOUT {
        CONTIGUOUS
        SIZE 99107168
        OFFSET 16059447379
        }
        FILTERS {
        NONE
        }
        FILLVALUE {
        FILL_TIME H5D_FILL_TIME_IFSET
        VALUE  H5D_FILL_VALUE_DEFAULT
        }
        ALLOCATION_TIME {
        H5D_ALLOC_TIME_EARLY
        }
        ATTRIBUTE "CLASS" {
        DATATYPE  H5T_STRING {
        STRSIZE 16;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1;
        }
        DATASPACE  SCALAR
        }
        ATTRIBUTE "NAME" {
        DATATYPE  H5T_STRING {
        STRSIZE 64;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1;
        }
        DATASPACE  SCALAR
        }
        ATTRIBUTE "REFERENCE_LIST" {
        DATATYPE  H5T_COMPOUND {
        H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
        H5T_STD_I32LE "dimension";
        }
        DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
        }
        }
        DATASET "BS_K_linearized1" {
        DATATYPE  H5T_IEEE_F32BE
        DATASPACE  SIMPLE { ( 2025000000 ) / ( 2025000000 ) }
        STORAGE_LAYOUT {
        CONTIGUOUS
        SIZE 8100000000
        OFFSET 2379
        }
        FILTERS {
        NONE
        }
        FILLVALUE {
        FILL_TIME H5D_FILL_TIME_IFSET
        VALUE  H5D_FILL_VALUE_DEFAULT
        }
        ALLOCATION_TIME {
        H5D_ALLOC_TIME_EARLY
        }
        ATTRIBUTE "CLASS" {
        DATATYPE  H5T_STRING {
        STRSIZE 16;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1;
        }
        DATASPACE  SCALAR
        }
        ATTRIBUTE "NAME" {
        DATATYPE  H5T_STRING {
        STRSIZE 64;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1;
        }
        DATASPACE  SCALAR
        }
        ATTRIBUTE "REFERENCE_LIST" {
        DATATYPE  H5T_COMPOUND {
        H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
        H5T_STD_I32LE "dimension";
        }
        DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
        }
        }
        DATASET "BS_K_linearized2" {
        DATATYPE  H5T_IEEE_F32BE
        DATASPACE  SIMPLE { ( 781887360 ) / ( 781887360 ) }
        STORAGE_LAYOUT {
        CONTIGUOUS
        SIZE 3127549440
        OFFSET 12150006475
        }
        FILTERS {
        NONE
        }
        FILLVALUE {
        FILL_TIME H5D_FILL_TIME_IFSET
        VALUE  H5D_FILL_VALUE_DEFAULT
        }
        ALLOCATION_TIME {
        H5D_ALLOC_TIME_EARLY
        }
        ATTRIBUTE "CLASS" {
        DATATYPE  H5T_STRING {
        STRSIZE 16;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1;
        }
        DATASPACE  SCALAR
        }
        ATTRIBUTE "NAME" {
        DATATYPE  H5T_STRING {
        STRSIZE 64;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1;
        }
        DATASPACE  SCALAR
        }
        ATTRIBUTE "REFERENCE_LIST" {
        DATATYPE  H5T_COMPOUND {
        H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
        H5T_STD_I32LE "dimension";
        }
        DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
        }
        }
        DATASET "complex" {
        DATATYPE  H5T_IEEE_F32BE
        DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
        STORAGE_LAYOUT {
        CONTIGUOUS
        SIZE 8
        OFFSET 16059447371
        }
        FILTERS {
        NONE
        }
        FILLVALUE {
        FILL_TIME H5D_FILL_TIME_IFSET
        VALUE  H5D_FILL_VALUE_DEFAULT
        }
        ALLOCATION_TIME {
        H5D_ALLOC_TIME_EARLY
        }
        ATTRIBUTE "CLASS" {
        DATATYPE  H5T_STRING {
        STRSIZE 16;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1;
        }
        DATASPACE  SCALAR
        }
        ATTRIBUTE "NAME" {
        DATATYPE  H5T_STRING {
        STRSIZE 64;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1;
        }
        DATASPACE  SCALAR
        }
        ATTRIBUTE "REFERENCE_LIST" {
        DATATYPE  H5T_COMPOUND {
        H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
        H5T_STD_I32LE "dimension";
        }
        DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
        }
        }
        }
        }



        On Sat, May 2, 2020 at 5:55 PM +0200, "Wei-Keng Liao"
        <wkliao@xxxxxxxxxxxxxxxx <mailto:wkliao@xxxxxxxxxxxxxxxx>> wrote:

            For HDF5 files, command “h5dump -Hp ndb.BS_COMPRESS0.005000_Q1” 
shows
            the data chunk settings used by all datasets in the file.

            Command “h5stat -Ss ndb.BS_COMPRESS0.005000_Q1” shows information 
about
            free space, metadata, raw data, etc.

            They may reveal why your file is abnormal big.
            Most likely it is the chunk setting you used.

            Wei-keng

            > On May 1, 2020, at 6:40 PM, Davide Sangalli  wrote:
> > I also add > > ncvalidator ndb.BS_COMPRESS0.005000_Q1 > Error: Unknow file signature
            >     Expecting "CDF1", "CDF2", or "CDF5", but got "�HDF"
            > File "ndb.BS_COMPRESS0.005000_Q1" fails to conform with CDF file 
format specifications
> > Best,
            > D.
> > On 02/05/20 01:26, Davide Sangalli wrote:
            >> Output of ncdump -hs
>> >> D. >> >> ncdump -hs BSK_2-5B_X59RL-50B_SP_bse-io/ndb.BS_COMPRESS0.005000_Q1 >> >> netcdf ndb.BS_COMPRESS0 {
            >> dimensions:
            >>         BS_K_linearized1 = 2025000000 ;
            >>         BS_K_linearized2 = 781887360 ;
            >>         complex = 2 ;
            >>         BS_K_compressed1 = 24776792 ;
            >> variables:
            >>         char BSE_RESONANT_COMPRESSED1_DONE(BS_K_linearized1) ;
            >>                 BSE_RESONANT_COMPRESSED1_DONE:_Storage = 
"contiguous" ;
            >>         char BSE_RESONANT_COMPRESSED2_DONE(BS_K_linearized1) ;
            >>                 BSE_RESONANT_COMPRESSED2_DONE:_Storage = 
"contiguous" ;
            >>         char BSE_RESONANT_COMPRESSED3_DONE(BS_K_linearized2) ;
            >>                 BSE_RESONANT_COMPRESSED3_DONE:_Storage = 
"contiguous" ;
            >>         float BSE_RESONANT_COMPRESSED1(BS_K_compressed1, 
complex) ;
            >>                 BSE_RESONANT_COMPRESSED1:_Storage = "contiguous" 
;
            >>                 BSE_RESONANT_COMPRESSED1:_Endianness = "little" ;
            >> // global attributes:
            >>                 :_NCProperties = 
"version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.18" ;
            >>                 :_SuperblockVersion = 0 ;
            >>                 :_IsNetcdf4 = 1 ;
            >>                 :_Format = "netCDF-4" ;
>> >> >> >> On Sat, May 2, 2020 at 12:24 AM +0200, "Dave Allured - NOAA Affiliate" wrote: >> >> I agree that you should expect the file size to be about 1 byte per stored character. IMO the most likely explanation is that you have a netcdf-4 file with inappropriately small chunk size. Another possibility is a 64-bit offset file with crazy huge padding between file sections. This is very unlikely, but I do not know what is inside your writer code. >> >> Diagnose, please. Ncdump -hs. If it is 64-bit offset, I think ncvalidator can display the hidden pad sizes. >> >> >> On Fri, May 1, 2020 at 3:37 PM Davide Sangalli wrote:
            >> Dear all,
            >> I'm a developer of a fortran code which uses netcdf for I/O
>> >> In one of my runs I created a file with some huge array of characters.
            >> The header of the file is the following:
            >> netcdf ndb.BS_COMPRESS0 {
            >> dimensions:
            >>     BS_K_linearized1 = 2025000000 ;
            >>     BS_K_linearized2 = 781887360 ;
            >> variables:
            >>     char BSE_RESONANT_COMPRESSED1_DONE(BS_K_linearized1) ;
            >>     char BSE_RESONANT_COMPRESSED2_DONE(BS_K_linearized1) ;
            >>     char BSE_RESONANT_COMPRESSED3_DONE(BS_K_linearized2) ;
            >> }
>> >> The variable is declared as nf90_char which, according to the documentation should be 1 byte per element.
            >> Thus I would expect the total size of the file to be 1 
byte*(2*2025000000+781887360) ~ 4.5 GB
            >> Instead the file size is 16059445323 bytes ~ 14.96 GB, i.e. 
10.46 GB more and a factor 3.33 bigger
>> >> This happens consistently if I consider the file
            >> netcdf ndb {
            >> dimensions:
            >>     complex = 2 ;
            >>     BS_K_linearized1 = 2025000000 ;
            >>     BS_K_linearized2 = 781887360 ;
            >> variables:
            >>     float BSE_RESONANT_LINEARIZED1(BS_K_linearized1, complex) ;
            >>     char BSE_RESONANT_LINEARIZED1_DONE(BS_K_linearized1) ;
            >>     float BSE_RESONANT_LINEARIZED2(BS_K_linearized1, complex) ;
            >>     char BSE_RESONANT_LINEARIZED2_DONE(BS_K_linearized1) ;
            >>     float BSE_RESONANT_LINEARIZED3(BS_K_linearized2, complex) ;
            >>     char BSE_RESONANT_LINEARIZED3_DONE(BS_K_linearized2) ;
            >> }
            >> The float component should weight ~36 GB while the char 
component should be identical to before, i.e. 4.5 GB for a total of 40.5 GB
            >> The file is instead ~ 50.96 GB, i.e. again a factor 10.46 GB 
bigger than expected.
>> >> Why ? >> >> My character variables are something like
            >> "tnnnntnnnntnnnnnnnntnnnnnttnnnnnnnnnnnnnnnnt..."
            >> but the file size is already like that just after the file 
creation, i.e. before filling it.
>> >> Few info about the library, compiled linking to HDF5 (hdf5-1.8.18), with parallel IO support:
            >> Name: netcdf
            >> Description: NetCDF Client Library for C
            >> URL: http://www.unidata.ucar.edu/netcdf
            >> Version: 4.4.1.1
            >> Libs: -L${libdir}  -lnetcdf -ldl -lm 
/nfs/data/bin/Yambo/gcc-8.1.0/openmpi-3.1.0/yambo_ext_libs/gfortran/mpifort/v4/parallel/lib/libhdf5hl_fortran.a
 
/nfs/data/bin/Yambo/gcc-8.1.0/openmpi-3.1.0/yambo_ext_libs/gfortran/mpifort/v4/parallel/lib/libhdf5_fortran.a
 
/nfs/data/bin/Yambo/gcc-8.1.0/openmpi-3.1.0/yambo_ext_libs/gfortran/mpifort/v4/parallel/lib/libhdf5_hl.a
 
/nfs/data/bin/Yambo/gcc-8.1.0/openmpi-3.1.0/yambo_ext_libs/gfortran/mpifort/v4/parallel/lib/libhdf5.a
 -lz -lm -ldl -lcurl
            >> Cflags: -I${includedir}
>> >> Name: netcdf-fortran
            >> Description: NetCDF Client Library for Fortran
            >> URL: http://www.unidata.ucar.edu/netcdf
            >> Version: 4.4.4
            >> Requires.private: netcdf > 4.1.1
            >> Libs: -L${libdir} -lnetcdff
            >> Libs.private: -L${libdir} -lnetcdff -lnetcdf
            >> Cflags: -I${includedir}
>> >> Best,
            >> D.
>> -- >> Davide Sangalli, PhD
            >> CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) 
and MaX Centre
            >> Area della Ricerca di Roma 1, 00016 Monterotondo Scalo, Italy
            >> http://www.ism.cnr.it/en/davide-sangalli-cv/
            >> http://www.max-centre.eu/


  • 2020 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: