[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Efficient writing to a file



>To: address@hidden
>From: "Kate Edwards" <kateedwards11@xxxxxxxxxxx>
>Subject: netCDF C - What writes faster netcdf files: netcdf for perl,
>c, or fortran?
>Organization: Univ. of Washington, Applied Physics Lab
>Keywords: 200409011954.i81JsPqs012300 netCDF write

Kate,

> Thanks for your quick response.  You're right - I am not writing the file 
> efficiently.  If you can help me figure out how to fix this, I would be very 
> grateful.  Here's the situation.  I have 75 netcdf files.  Each contains 
> variables which have the following dimensions: xpos = 43, ypos= 44, zpos = 
> 15, and time = 48.  My goal is to concatenate these variables in time so 
> that I can analyze them in Matlab.  Which of the following is the way to do 
> this?
> 
> 1)  Open each file in Matlab and concatenate their data.  Very slow! or,
> 2) Concatenate the 75 files.  Save the concatenated data in 1 or more big 
> netcdf files of size 1-6 GB.  Open big files in Matlab to analyze the data.
> 
> If #2 is best, then what is the most efficient way to write the big files?  
> Currently, I first define variables such as salt(time, depth, xpos, ypos) to 
> be size 3600x15x44x43 in the big file.  Then I open the 75 small files 
> sequentially and put their 48 time records into the big file.  For example, 
> I would add records from the 2nd small file to the big file as
> nc_big{'salt'}(48:96, :, :, :) = nc_small{'salt'}(:, :, :, :)  ;
> 
> This is incredibly slow so I am doing something wrong.  If you could point 
> out what it is, I would be very grateful.

The best way to do this may depend on a few things I don't know from
your question, such as the amount of memory you have, the type of the
variables, whether you are using the unlimited dimension for time, and
how many variables there are.

If you are using the unlimited dimension for time, you can use a
freely available program that's part of the NCO (netCDF operators)
package, namely ncrcat.  And even if you aren't using the unlimited
dimension, you might be able to use the related concatenation program
ncecat.  These are documented here:

  http://nco.sourceforge.net/nco.html#Concatenation

and will probably create the desired output file faster than doing it
in Matlab.

If there's some reason the NCO programs are not appropriate, here's
the approach I would recommend.  I'll assume you have lots of memory,
so for example, you can read in all the values of a variable in one of
the small files in a single array.  Assuming the structure of the
small files you have 75 of is something like:

  dimensions:
    time=48;
    depth=15;
    ypos=44;
    xpos=43;
  variables:
    float salt(time, depth, ypos, xpos);
    float var2(time, depth, ypos, xpos);
    float var3(time, depth, ypos, xpos);
     ...

then I would do the following, where I hope this psuedocode is clear:

  declare a memory array that all values of a variable
  open the output file
  for each input file {
     open input file
         for each variable in input file {
             read variable into memory array with a single netcdf call
             write variable into output variable with a single netcdf call
         }
     close input file
  }
  close output file

Using less memory you could also do this with multiple reads and
writes of each variable, for example with an inner time loop that just
read all the data for each time for each variable and wrote out the
appropriate time slab for each variable.  This would take about the
same amount of time.

The above is about as efficiently as you can do this, because it just
opens each input file once, reads each variable once contiguously, and
writes contiguous data values in the output.  This is probably what
the ncrcat program does.

I would think you could do this in Matlab as well as C, C++,
Fortran-77, Fortran-90, Java, Python, or Ruby all just about as
efficiently, since it will be I/O bound.

If you have many variables, you may have problems if the output file
exceeds 2 Gibytes (2147483648 bytes) in size, since you will have to
use the unlimited dimension for time.  You will also have to make sure
your file system supports large files (> 2 GiB) and you don't have any
user-specific file size limits less than the output size required.  We
will shortly release netCDF 3.6.0 which has improved support for
large files so that it would not be necessary to use the unlimited
(record) variable in the output file, and there is a beta test version
available now if you need to test that feature, but version 3.5
should work fine if you need to exceed 2 GiB and can use the record
variable for time.

--Russ