[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: many little files versus one big file



> Keywords: 199404152043.AA21089

Charles,

> > Since the files are so small, could you send me the output of "ncdump" on
> > two or three of them?  That might be enough information for me to construct
> > a small test case.
> 
> Here are 10 of them.

Thanks.  I unshared the files you sent, then created three files from them:

    i1.nc       with only 5 records, just "ncgen -b" on one of your files
    i10.nc      with 50 records, by cutting and pasting in i1.cdl and using
                "ncgen -b" on the result.
    i100.nc     with 500 records, by doing the same with i10.nc.

The dimensions, variables, and structure are identical with your files.
There are just more values for the four record variables 

    frame_labels
    cum_occurrences
    gsp_mean
    lpc_mean

in the latter two files.  Here are the sizes of the resulting netCDF files:

    -rw-rw-r--   1 russ     2848 Apr 26 15:06 i1.nc
    -rw-rw-r--   1 russ    14368 Apr 26 15:06 i10.nc
    -rw-rw-r--   1 russ   129568 Apr 26 15:14 i100.nc

When I run ncdump, a command that accesses all the data on these files, I
don't see any timing anomalies:

    buddy% timex ncdump i1.nc > /dev/null

    real        0.16
    user        0.03
    sys         0.07

    buddy% timex ncdump i10.nc > /dev/null

    real        0.32
    user        0.20
    sys         0.09

    buddy% timex ncdump i100.nc > /dev/null

    real        2.52
    user        1.81
    sys         0.12

The times reported by the "timex" command are in seconds, and I ran this on
an unloaded SPARCstation 10 under Solaris 2.3, using the current version of
the netCDF library.  Since I can't reproduce your timing results where you
are seeing larger files take much longer to access that would be predicted
by a linear model based on the number of records in the file, perhaps what
you are seeing is system-dependent or depends on the order in which you
access the data.  ncdump just accesses all the data in sequential order.

It looks like I'll have to have more information (what machine and operating
system you used) or a program that accesses the data in an order closer to
what you use in your timings in order to reproduce the problem here.  I
could also send you the three netCDF files I created and you could try
ncdump on them, to see if the problem is system-dependent.

__________________________________________________________________________
                      
Russ Rew                                              UCAR Unidata Program
address@hidden                                        P.O. Box 3000
(303)497-8645                                 Boulder, Colorado 80307-3000