Re: [netcdf-hdf] a question about HDF5 and large file - why so long to write one value?

  • To: Ed Hartnett <ed@xxxxxxxxxxxxxxxx>
  • Subject: Re: [netcdf-hdf] a question about HDF5 and large file - why so long to write one value?
  • From: Quincey Koziol <koziol@xxxxxxxxxxxx>
  • Date: Mon, 20 Aug 2007 15:39:03 -0500
Hi Ed,

On Aug 18, 2007, at 7:52 AM, Ed Hartnett wrote:

Howdy all!

I am writing a test program which writes large files (well over 2
GB). I have some questions about HDF5 and very large files. I need to
check out whether netCDF-4 has been correctly implemented for best
performance.

In the program below, I create 4 datasets, of type double. They are
one-dimensional, with length 2147483644/4. (That is 17179869152 bytes
of data.)

Then I write the last value only in each dataset.

Took a really long time - minutes. Is this expected? What is HDF5
doing in the background here? Is there something I can do with
chunking here to improve the speed of this program?

I am not setting a fill calue, so what is being written here? I
naively expected that HDF5 would not write all the data I am skipping,
but would find a way to write data only around the value that I am
actually writing...

The file that this program creates is 17179883735 bytes, which is
14583 bytes of HDF5 overhead. Is that about what is expected?

Any comments welcome...

The problem is in your computation of the chunk size for the dataset, in libsrc4/nc4hdf.c, around lines 1059-1084. The current computations end up with a chunk of size equal to the dimension size (2147483644/4 in the code below), i.e. a single 4GB chunk for the entire dataset. This is not going to work well, since HDF5 always reads an entire chunk into memory, updates it and then writes the entire chunk back out to disk. ;-)

That section of code looks like it has the beginning of some heuristics for automatically tuning the chunk size, but it would probably be better to let the application set a particular chunk size, if possible.

        Quincey


Thanks,

Ed

/*
 Copyright 2007, UCAR/Unidata
 See COPYRIGHT file for copying and redistribution conditions.

 This program (quickly, but not throughly) tests the large file
 features of netCDF-4.

 $Id: tst_large.c,v 1.3 2007/08/18 12:26:38 ed Exp $
*/
#include <config.h>
#include <nc_tests.h>
#include <netcdf.h>
#include <stdio.h>
#include <string.h>

/* This is the magic number for classic format limits: 2 GiB - 4
   bytes. */
#define MAX_CLASSIC_BYTES 2147483644

/* This is the magic number for 64-bit offset format limits: 4 GiB - 4
   bytes. */
#define MAX_64OFFSET_BYTES 4294967292

/* Handy for constucting tests. */
#define QTR_CLASSIC_MAX (MAX_CLASSIC_BYTES/4)

/* We will create this file. */
#define FILE_NAME "tst_large.nc"

int
main(int argc, char **argv)
{

printf("\n*** Testing really large files in netCDF-4/HDF5 format, quickly.\n");

    printf("\n*** Testing create of simple, but large, file...");
    {
#define DIM_NAME "Time_in_nanoseconds"
#define NUMDIMS 1
#define NUMVARS 4

       int ncid, dimids[NUMDIMS], varid[NUMVARS];
char var_name[NUMVARS][NC_MAX_NAME + 1] = {"England", "Scotland", "Ireland", "Wales"};
       size_t index[2] = {QTR_CLASSIC_MAX-1, 0};
       int ndims, nvars, natts, unlimdimid;
       nc_type xtype;
       char name_in[NC_MAX_NAME + 1];
       size_t len;
       double pi = 3.1459, pi_in;
       int i;

       /* Create a netCDF netCDF-4/HDF5 format file, with 4 vars. */
       if (nc_create(FILE_NAME, NC_NETCDF4, &ncid)) ERR;
       if (nc_set_fill(ncid, NC_NOFILL, NULL)) ERR;
       if (nc_def_dim(ncid, DIM_NAME, QTR_CLASSIC_MAX, dimids)) ERR;
       for (i = 0; i < NUMVARS; i++)
       {
          if (nc_def_var(ncid, var_name[i], NC_DOUBLE, NUMDIMS,
                         dimids, &varid[i])) ERR;
       }
       if (nc_enddef(ncid)) ERR;
       for (i = 0; i < NUMVARS; i++)
          if (nc_put_var1_double(ncid, i, index, &pi)) ERR;
       if (nc_close(ncid)) ERR;

       /* Reopen and check the file. */
       if (nc_open(FILE_NAME, 0, &ncid)) ERR;
       if (nc_inq(ncid, &ndims, &nvars, &natts, &unlimdimid)) ERR;
if (ndims != NUMDIMS || nvars != NUMVARS || natts != 0 || unlimdimid != -1) ERR;
       if (nc_inq_dimids(ncid, &ndims, dimids, 1)) ERR;
       if (ndims != 1 || dimids[0] != 0) ERR;
       if (nc_inq_dim(ncid, 0, name_in, &len)) ERR;
       if (strcmp(name_in, DIM_NAME) || len != QTR_CLASSIC_MAX) ERR;
       for (i = 0; i < NUMVARS; i++)
       {
if (nc_inq_var(ncid, i, name_in, &xtype, &ndims, dimids, &natts)) ERR; if (strcmp(name_in, var_name[i]) || xtype != NC_DOUBLE || ndims ! = 1 ||
              dimids[0] != 0 || natts != 0) ERR;
          if (nc_get_var1_double(ncid, i, index, &pi_in)) ERR;
          if (pi_in != pi) ERR;
       }
       if (nc_close(ncid)) ERR;
    }

    SUMMARIZE_ERR;
    FINAL_RESULTS;
}


--
Ed Hartnett  -- ed@xxxxxxxxxxxxxxxx

_______________________________________________
netcdf-hdf mailing list
netcdf-hdf@xxxxxxxxxxxxxxxx
For list information or to unsubscribe, visit: http:// www.unidata.ucar.edu/mailing_lists/


Attachment: smime.p7s
Description: S/MIME cryptographic signature