Re: overcoming netcdf3 limits

John Caron wrote:
> Hi Greg:
>
> Can you send me the output from ncdump -h on a typical file ?
Attached is the output from one of our files.  The file it comes from is
approximately 407 MB. Files can range from a few hundred bytes up to
100's GB.
>
> How does the parallel part work? do certain sections of the file get
> reserved for different nodes, or is there a master node that
> essentially serializes the writing?
During mesh generation, we generate the model in anywhere from 1 to M
separate pieces and then join these pieces into a single mesh model
prior to the analysis.  We then have another tool which will determine
how to decompose the model across the number of processors being used
for the analysis run.  It then creates N subpieces of the model which
are then read by the analysis code. The analysis code then runs and each
processor writes its portion of the model to a results file. After the
job is finished, then we can combine those pieces into a single piece
for visualization.  We also have the option to join into subpieces (join
4,000 individual pieces into 10 groups) or some of the visualizers will
read the individual pieces without joining.   An old description of what
we did back in 1999 is available at
http://endo.sandia.gov/SEACAS/Documentation/Parallel_Instruction.pdf. 
The basic operation is the same today, but the sizes of models have
increased greatly and some of the processes and codes have changed.

--Greg
>
> Greg Sjaardema wrote:
>> As a quick answer to the question, we (Sandia Labs) use netcdf
>> underneath our exodusII
>> file format for storing finite element results data.
>>
>> If the mesh contains #nodes nodes and #elements elements, then there
>> will be a dataset of the size #elements*8*4 (assuming a hex element with
>> 8 nodes, 4 bytes/int) to store the nodal connectivity of each hex
>> element in a group of elements (element block). Assuming 4GiB, this
>> limits us to ~134 Million elements per element block (using CDF-2) which
>> is large, but not enough to give us more than a few months breathing
>> room.    Using CDF-1 format, we top out at about 30 million elements or
>> less which is hit routinely.
>>
>> There is a pdf file at
>> http://endo.sandia.gov/SEACAS/Documentation/exodusII.pdf that shows
>> (starting at page 177) how we map exodusII onto netcdf.  There have been
>> some changes since the report was written to reduce some of the dataset
>> sizes.  For example, we split the "coord" dataset into 3 separate
>> datasets now and we also split the vals_nod_var into a single dataset
>> per nodal variable.
>>
>> --Greg
>>
>>
>> John Caron wrote:
>>> Hi Rob:
>>>
>>> Could you give use case(s) where the limits are being hit?
>>> I'd be interested in actual dimension sizes, number of variables,
>>> whether you are using a record dimension, etc.
>>>
>>> Robert Latham wrote:
>>>> Hi
>>>>
>>>> Over in Parallel-NetCDF land we're running into users who find even
>>>> the CDF-2 file format limitations, well, limiting.
>>>> http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/NetCDF-64-bit-Offset-Format-Limitations.html
>>>>
>>>>
>>>>
>>>> http://www.unidata.ucar.edu/software/netcdf/docs/faq.html#Large%20File%20Support10
>>>>
>>>>
>>>>
>>>> If we worked up a CDF-3 file format for parallel-netcdf (off the top
>>>> of my head, maybe a 64 bit integer instead of an unsigned 32 bit
>>>> integer could be used to describe variables), would the serial netcdf
>>>> folks be interested, or are you looking to the new netcdf-4 format to
>>>> take care of these limits?
>>>>
>>>> Thanks
>>>> ==rob
>>>>
>>> ==============================================================================
>>>
>>>
>>> To unsubscribe netcdfgroup, visit:
>>> http://www.unidata.ucar.edu/mailing-list-delete-form.html
>>> ==============================================================================
>>>
>>>
>>>
>>>
>
netcdf 2000_Hz_plate {
dimensions:
        len_string = 33 ;
        len_line = 81 ;
        four = 4 ;
        time_step = UNLIMITED ; // (81 currently)
        num_info = 219 ;
        num_qa_rec = 2 ;
        num_dim = 3 ;
        num_nodes = 39316 ;
        num_elem = 33208 ;
        num_el_blk = 6 ;
        num_side_sets = 3 ;
        num_el_in_blk1 = 5492 ;
        num_nod_per_el1 = 8 ;
        num_el_in_blk2 = 3498 ;
        num_nod_per_el2 = 8 ;
        num_el_in_blk3 = 23858 ;
        num_nod_per_el3 = 8 ;
        num_el_in_blk4 = 216 ;
        num_nod_per_el4 = 8 ;
        num_el_in_blk5 = 72 ;
        num_nod_per_el5 = 8 ;
        num_el_in_blk6 = 72 ;
        num_nod_per_el6 = 8 ;
        num_side_ss1 = 66 ;
        num_side_ss2 = 832 ;
        num_side_ss3 = 614 ;
        num_glo_var = 4 ;
        num_nod_var = 9 ;
        num_elem_var = 9 ;
variables:
        double time_whole(time_step) ;
        char info_records(num_info, len_line) ;
        char qa_records(num_qa_rec, four, len_string) ;
        int eb_status(num_el_blk) ;
        int eb_prop1(num_el_blk) ;
                eb_prop1:name = "ID" ;
        int ss_status(num_side_sets) ;
        int ss_prop1(num_side_sets) ;
                ss_prop1:name = "ID" ;
        double coord(num_dim, num_nodes) ;
        char coor_names(num_dim, len_string) ;
        int connect1(num_el_in_blk1, num_nod_per_el1) ;
                connect1:elem_type = "hex8" ;
        int connect2(num_el_in_blk2, num_nod_per_el2) ;
                connect2:elem_type = "hex8" ;
        int connect3(num_el_in_blk3, num_nod_per_el3) ;
                connect3:elem_type = "hex8" ;
        int connect4(num_el_in_blk4, num_nod_per_el4) ;
                connect4:elem_type = "hex8" ;
        int connect5(num_el_in_blk5, num_nod_per_el5) ;
                connect5:elem_type = "hex8" ;
        int connect6(num_el_in_blk6, num_nod_per_el6) ;
                connect6:elem_type = "hex8" ;
        int elem_num_map(num_elem) ;
        int node_num_map(num_nodes) ;
        int elem_ss1(num_side_ss1) ;
        int side_ss1(num_side_ss1) ;
        int elem_ss2(num_side_ss2) ;
        int side_ss2(num_side_ss2) ;
        int elem_ss3(num_side_ss3) ;
        int side_ss3(num_side_ss3) ;
        double vals_glo_var(time_step, num_glo_var) ;
        char name_glo_var(num_glo_var, len_string) ;
        double vals_nod_var(time_step, num_nod_var, num_nodes) ;
        char name_nod_var(num_nod_var, len_string) ;
        char name_elem_var(num_elem_var, len_string) ;
        double vals_elem_var1eb1(time_step, num_el_in_blk1) ;
        double vals_elem_var2eb1(time_step, num_el_in_blk1) ;
        double vals_elem_var3eb1(time_step, num_el_in_blk1) ;
        double vals_elem_var4eb1(time_step, num_el_in_blk1) ;
        double vals_elem_var5eb1(time_step, num_el_in_blk1) ;
        double vals_elem_var6eb1(time_step, num_el_in_blk1) ;
        double vals_elem_var7eb1(time_step, num_el_in_blk1) ;
        double vals_elem_var8eb1(time_step, num_el_in_blk1) ;
        double vals_elem_var9eb1(time_step, num_el_in_blk1) ;
        double vals_elem_var2eb2(time_step, num_el_in_blk2) ;
        double vals_elem_var3eb2(time_step, num_el_in_blk2) ;
        double vals_elem_var4eb2(time_step, num_el_in_blk2) ;
        double vals_elem_var5eb2(time_step, num_el_in_blk2) ;
        double vals_elem_var6eb2(time_step, num_el_in_blk2) ;
        double vals_elem_var7eb2(time_step, num_el_in_blk2) ;
        double vals_elem_var8eb2(time_step, num_el_in_blk2) ;
        double vals_elem_var9eb2(time_step, num_el_in_blk2) ;
        double vals_elem_var2eb3(time_step, num_el_in_blk3) ;
        double vals_elem_var3eb3(time_step, num_el_in_blk3) ;
        double vals_elem_var4eb3(time_step, num_el_in_blk3) ;
        double vals_elem_var5eb3(time_step, num_el_in_blk3) ;
        double vals_elem_var6eb3(time_step, num_el_in_blk3) ;
        double vals_elem_var7eb3(time_step, num_el_in_blk3) ;
        double vals_elem_var8eb3(time_step, num_el_in_blk3) ;
        double vals_elem_var9eb3(time_step, num_el_in_blk3) ;
        double vals_elem_var2eb4(time_step, num_el_in_blk4) ;
        double vals_elem_var3eb4(time_step, num_el_in_blk4) ;
        double vals_elem_var4eb4(time_step, num_el_in_blk4) ;
        double vals_elem_var5eb4(time_step, num_el_in_blk4) ;
        double vals_elem_var6eb4(time_step, num_el_in_blk4) ;
        double vals_elem_var7eb4(time_step, num_el_in_blk4) ;
        double vals_elem_var8eb4(time_step, num_el_in_blk4) ;
        double vals_elem_var9eb4(time_step, num_el_in_blk4) ;
        double vals_elem_var2eb5(time_step, num_el_in_blk5) ;
        double vals_elem_var3eb5(time_step, num_el_in_blk5) ;
        double vals_elem_var4eb5(time_step, num_el_in_blk5) ;
        double vals_elem_var5eb5(time_step, num_el_in_blk5) ;
        double vals_elem_var6eb5(time_step, num_el_in_blk5) ;
        double vals_elem_var7eb5(time_step, num_el_in_blk5) ;
        double vals_elem_var8eb5(time_step, num_el_in_blk5) ;
        double vals_elem_var9eb5(time_step, num_el_in_blk5) ;
        double vals_elem_var2eb6(time_step, num_el_in_blk6) ;
        double vals_elem_var3eb6(time_step, num_el_in_blk6) ;
        double vals_elem_var4eb6(time_step, num_el_in_blk6) ;
        double vals_elem_var5eb6(time_step, num_el_in_blk6) ;
        double vals_elem_var6eb6(time_step, num_el_in_blk6) ;
        double vals_elem_var7eb6(time_step, num_el_in_blk6) ;
        double vals_elem_var8eb6(time_step, num_el_in_blk6) ;
        double vals_elem_var9eb6(time_step, num_el_in_blk6) ;
        int elem_var_tab(num_el_blk, num_elem_var) ;

// global attributes:
                :api_version = 3.25f ;
                :version = 2.05f ;
                :floating_point_word_size = 8 ;
                :title = "Default Sierra Title" ;
}