[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: many little files versus one big file



> Keywords: 199404152043.AA21089

Charles,

> > > I have about 1700 files totaling about 2 Mbytes of data.  If I load
> > > and process them individually, it takes about 7 seconds.  If I
> > > concatenate them together on the unlimited dimension to form one big
> > > file, it takes about 35 seconds to load and process them.  Is this
> > > expected?  Wouldn't there be less overhead with just one file?  I got
> > > the exact same results in both cases so I don't think did anything
> > > terribly wrong.
> > 
> > I'm surprised by the times you are seeing, and would expect that
> > accessing the data as one file would require slightly less time than
> > using lots of little files.  If you can construct a small test case or
> > script that demonstrates this, I could try to find out the reason for
> > the counterintuitive timings.
> > 
> > Are the sums of the sizes of the small files similar to the size of the
> > single large file?  I can imagine that since the record dimension
> > requires padding each record out to an even 32-bit boundary, if you were
> > storing only one byte in each record, the record file would require 4
> > times as much storage and more time to access.
> > 
> > Another possibility is that you are seeing an artifact of the
> > pre-filling of each record with fill values when the first data is
> > written in each record.  This can be avoided by using the ncsetfill()
> > interface (or NCSFIL for Fortran) to specify that records not be
> > pre-filled.  See the User's Guide section on "Set Fill Mode for Writes"
> > to find out more about this optimization and when it is appropriate.
> 
> Russ,
> 
> I might get a chance to construct a test case at some point, but it's
> not easy to isolate.  Note that I'm only reading the files.  Here are
> some more details:
> 
>   cnt = 1737          % this many individual files
>   sum = 4882464         % total number of bytes (more than single big file)
>   ave = 2810.86
>   min = 1824
>   max = 3872

Since the files are so small, could you send me the output of "ncdump" on
two or three of them?  That might be enough information for me to construct
a small test case.

__________________________________________________________________________
                      
Russ Rew                                              UCAR Unidata Program
address@hidden                                        P.O. Box 3000
(303)497-8645                                 Boulder, Colorado 80307-3000