[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: many little files versus one big file



> Keywords: 199404152043.AA21089

Hi Charles,

> I have about 1700 files totaling about 2 Mbytes of data.  If I load
> and process them individually, it takes about 7 seconds.  If I
> concatenate them together on the unlimited dimension to form one big
> file, it takes about 35 seconds to load and process them.  Is this
> expected?  Wouldn't there be less overhead with just one file?  I got
> the exact same results in both cases so I don't think did anything
> terribly wrong.

I'm surprised by the times you are seeing, and would expect that accessing
the data as one file would require slightly less time than using lots of
little files.  If you can construct a small test case or script that
demonstrates this, I could try to find out the reason for the
counterintuitive timings.

Are the sums of the sizes of the small files similar to the size of the
single large file?  I can imagine that since the record dimension requires
padding each record out to an even 32-bit boundary, if you were storing only
one byte in each record, the record file would require 4 times as much
storage and more time to access.  

Another possibility is that you are seeing an artifact of the pre-filling of
each record with fill values when the first data is written in each record.
This can be avoided by using the ncsetfill() interface (or NCSFIL for
Fortran) to specify that records not be pre-filled.  See the User's Guide
section on "Set Fill Mode for Writes" to find out more about this
optimization and when it is appropriate.

__________________________________________________________________________
                      
Russ Rew                                              UCAR Unidata Program
address@hidden                                        P.O. Box 3000
(303)497-8645                                 Boulder, Colorado 80307-3000