[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #PEB-847323]: Re: [netcdf-hdf] NF90_GET_VAR memory leak?



> Howdy!
> 
> Thanks for looking into this.
> 
> I have tried to use parallel/IO with limited success.  The main
> limitation is that it doesn't allow for compression (HDF5 issue, I
> know).  The computing cluster where I make my big runs has a parallel
> file system (panasas or something like that), but my job just hangs
> when I try to use netcdf4 parallel IO.  The system is a seemingly
> standard linux cluster with x86 processors, infiniband, and an LSF
> batch (details at 
> <http://www.oscer.ou.edu/hardsoft_dell_cluster_harpertown_sooner.php
> > ).  Parallel/IO works fine on my desktop machine (Mac OS X),
> however, so I think my code is OK.  On the mac I usually use 8 or
> fewer threads, in which case a round-robin (token ring) write works
> better because I get the compression.

As you point out, the compression issue is not something we can do anything 
about. It is because there is no way for a process to predict where to write 
its data, since the data of other processes, who write earlier in the file, 
will be of some unknown length. So the process cannot tell where it should 
write the data.

Getting code to run on supercomputers is always a challenge. Parallel I/O is 
well-tested and does work. Was the netCDF on the supercomputer built for 
parallel? (HDF5 must be built with --enable-parallel, and mpicc must be used to 
compile).

There are parallel tests in the netCDF distribution (like 
nc_test4/tst_parallel3.c) which can be run on your target platform. If they 
hang, there is something wrong with the platform and you can demonstrate it to 
the sysadmins. If they work, you can take a look to see what they are doing 
that your code is not doing.


> 
> I tried a while back to set up an example with fake data to try to
> reproduce the memory growth problem I was seeing, but without real
> success.  I thought then that maybe there was a bug in my program, but
> switching the reading from netcdf4 to pure hdf5 seemed to solve the
> problem.  So I think it really is something with netcdf4's routines.
> 
> I'm getting an ftp directory set up so you can get an example file.
> I'll let you know when that happens.
> 

Just do an ncdump -h on one of your data files and send it to me, and I will 
take it from there.

I am getting ready to release 4.1 very soon, so if we want to get any fixes it, 
it would be best if you could send me the ncdump right away...

Thanks,

Ed

Ticket Details
===================
Ticket ID: PEB-847323
Department: Support netCDF
Priority: Critical
Status: Closed