[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #PKT-462504]: Re: [netcdfgroup] netcdf-4 open/close memory leak



> Howdy, Ed et al.
> 
> I got snapshot2010011908 and reran some tests, which unfortunately
> don't show much change.  I'll start with the simple open/close memory
> issue that Jeff Whitaker brought up.  The code was short enough, to
> which I added the get_mem_used calls:
> 


Howdy Ted!

I have added this test to libsrc4/tst_files2.c, which is only built with 
--enable-benchmarks, or "make tst_files2". (But you must wait for tomorrow 
morning's snapshot to see this.)

But I get different results from you.

Furthermore, I have come up with three different  get_mem_used functions. All 
should be correct, as far as I can tell, but all give different answers. Sigh.

> [I mangled the name of get_mem_used_ to use it with fortran.]
> 
> With a netcdf3 file, the output is (last column is net change per
> iteration)
> 
> start: memuse= 440    440     440
> memuse,delta= 452     1568    1576    1116    1124
> memuse,delta= 1576    1584    1576    8       0
> memuse,delta= 1576    1584    1576    8       0
> memuse,delta= 1576    1584    1576    8       0
> memuse,delta= 1576    1584    1576    8       0
> memuse,delta= 1576    1584    1576    8       0
> 

Isn't netCDF classic so nice and well-behaved? ;-)

What input file are you using? The ones created by tst_files2.c?

> 
> But with a netcdf4 file:
> 
> start: memuse= 440    440     440
> memuse,delta= 452     2804    2316    2352    1864
> memuse,delta= 2316    2852    2320    536     4
> memuse,delta= 2320    2856    2324    536     4
> memuse,delta= 2324    2860    2328    536     4
> memuse,delta= 2328    2864    2332    536     4
> memuse,delta= 2332    2868    2336    536     4
> memuse,delta= 2336    2868    2336    532     0
> memuse,delta= 2336    2872    2340    536     4
> memuse,delta= 2340    2876    2344    536     4
> 
> Oddly, there is an occasional 0 increase for netcdf4, and I can't tell
> if there is a pattern to it.  But in general there is a net 4 kB
> increase in memory use for just opening and closing a file, so I guess
> something is not being freed.  I assume the initial jump from 440 to
> 1568 (or 2804) has to do with opening the first file.

But I *know* everything is being freed. Because valgrind would tell me. So what 
is happening here is perhaps that HDF5 is allocating memory, and not freeing it 
until the library exits. This would produce the problem we are seeing. But I 
don't think they would be crazy enough to do this. I have sent them an email 
asking about it.

I know what they will ask in turn: that you upgrade to their latest snapshot 
too, and also make sure that you build HDF5 with --enable-using-memchecker. 

> 
> Now, on to the other memory problem I have:
> 
> I managed to hack the get_mem_used into my Fortran90 code to combine
> data from multiple files, and it seems to shed light on what is going
> on.   I have two methods reading the data: 1) Use netcdf4 to read each
> file variable at a given time and 2) use the HDF5 interface to read
> the data from each file.  Both cases are using the fortran interfaces
> and both use the netCDF4 interface to write out the combined data.
> 
> The individual files have chunk sizes for the 3D variables that are
> equal to the spatial dimensions x,y,z (40x60x80) with time being the
> unlimited dimension.

Did you set these chunksizes, or these are the default? And what chunk size are 
you using for the unlimited dimension? (The default is 1).

> 
> So, when reading a variable from 64 files, the total chunk cache would
> be 40x60x80 gridpoints x 4 bytes per point x 64 files = 48000 kB = 47
> MB, and this is almost exactly the average increase in memory that I
> see when using netcdf4 to read the files.  (There is an additional 8
> MB per variable added by the NF90_DEF_VAR process.  If all the input
> files are closed and reopened after reading a variable, about half of
> the memory (27 MB or so) gets freed up.  The closing/opening process
> takes a lot time, too, much more than the time to read a variable.

Wait, how do you get that this is the size of the total chunk cache? 

If you are using a recent snapshot of netCDF (and you need to get the very 
latest - I just put one out this afternoon, and I think the automatic cache 
sizing is working well now.)

Here's what happens with cache size: whatever you set for the file level cache 
(with nc_set_chunk_cache) will be used by default for each variable in the 
file. That is, if the file level cache is set to 100 MB, and you open a file 
with 10 variables, it should consume 100 MB * 10 or 1 GB of memory for the file.

The default file-level cache size is now set to 4 MB is the current netCDF 
setting. HDF5 uses 1 MB by default, but netCDF overrides that.

Now when a file is opened, the cache size is adjusted for each variable as the 
file is opened. The chunk cache will be sized to hold 10 chunks from each var, 
up to a 64 MB max for each variable. The per-var chunk cache can be modified 
with nc_set_var_chunk_cache. (And Russ has wondered if calling that, with a 
setting of zero, would change your results.)

So the cache size is a little bit of a complicated topic.
> 
> For what it's worth, the chunk size for 3D variables in the combined
> file is reported by ncdump as 32, 173, 346 for 7482 kB (4-byte
> reals).  The spatial dimensions are 60 x 319 x 640 (z,y,x).
> 
> When I use direct HDF5 calls to read the data, on the other hand,
> there is hardly any (16kB) increase in memory usage (I open one file
> at a time, read the current variable, and close the dataset and file
> before going to the next file).  Also, the total time is the same as
> for using netcdf4, so there seems to be no performance penalty for
> deallocating the dataset memory.

But I *know* that when you close a netCDF-4 file, I have closed all HDF5 
objects in the file. Let me run another test to demonstrate that tomorrow...

> 
> Now, I can understand leaving a read cache around if it might be used
> again, but here's the rub: when I read a second time level, the memory
> keeps going up by 47 MB for each variable, even though that variable
> was already read before.  So it seems that the previously-allocated
> cache is not getting reused, but is being allocated again?
> 
> My suggestion is to try assuming that a read cache will _not_ be used
> again, and just deallocate it as soon as the read task is finished.
> (Perhaps this would be in nc4hdf.c?)  Based on my HDF5 read tests,
> there's no performance penalty from having to allocate the space
> again, so there seems to be little reason to keep it hanging around.
> Particularly since it can add up to a lot of space when reading from
> lots of files.  I'd be willing to test it, too.
> 
> Best,

Thanks, I will hit this again tomorrow morning.

Ticket Details
===================
Ticket ID: PKT-462504
Department: Support netCDF
Priority: Critical
Status: Open