Hi Jesse, > It's netcdf 3.6.3. The source for the program writing is attached. All > it's doing is taking a few netcdf files (48 in this case) and combining > them. > > If we change the operation so that the written file is outside of the > lustre file system, such as symbolically linking the output file to > /dev/shm or even a local file system, the write occurs in less than 20 > seconds. If the output file is on the lustre file system, it takes > about 5 minutes. > > We do have parallel netcdf installed, but the scientists would have to > link their models against it of course. If the performance of parallel > netcdf was sufficiently high it would be an easy sell. It looks like this is exactly the problem first identified in nco software, and later verified with nccopy, that's fixed in the latest release of nccopy and nco. I'm assuming all the input files have a unlimited ("record") dimension, and that the merged output file does also. The problem is that the program writes output variables a variable-at-a-time, when it should instead be writing output records a record-at-a-time, with all the variable data written for each record, before advancing to the next record. Using the first strategy is slow when you have a lot of record variables and large disk block size (such as on Lustre file systems). That's because a disk block is typically larger than a record's worth of data for one variable, so writing a variable-at-a-time ends up rewriting the same disk block multiple times, for each variable whose data is included in that disk block. The problem and its solution are explained in more detail here, starting with the seventh posting in the forum: http://sourceforge.net/projects/nco/forums/forum/9829/topic/4898620/index/page/1 You have several options, depending on whether you need an unlimited dimension in the output (merged) file: 1. If you don't need an unlimited dimension in the output, define the dimension corresponding to the unlimited dimensions in the input to be of fixed size (probably just the sum of the unlimited dimension sizes in the input). Then writing a variable at a time will be fast. 2. If you still need an unlimited dimension in the output, change the order of the nested loops and the start and count vectors so that the record dimension is the outside loop. For each record, read a record's worth of data from all the input files that include data for that record and all associated record variables. This will also greatly speed up the program on Lustre file systems, or any file systems with large disk block size. 3. Consider using a package such as NCO (or NCL or CDO) to do the data concatenation for you. This problem is well-solved in the NCO concatenation operators, may be solved in NCL, and I don't know about CDO. I think rewriting your current WRF processing program to reorder the loops and data writing wouldn't take too long, but you might be able to adapt an NCO solution such as ncrcat in even less time: http://nco.sourceforge.net/nco.html#ncrcat-netCDF-Record-Concatenator --Russ > > On 05/11/2012 10:38 AM, Unidata netCDF Support wrote: > > Hi Jesse, > > > >> I have an HPC cluster using lustre as our backend file systems. The > >> cluster serves primarily weather models, such as the WRF and GFS. > >> > >> One thing we observed is that netcdf writes can often be very slow on > >> lustre. Do you have any recommended tuning procedures for netcdf on > >> lustre? > > > > No, sorry, we don't currently test on lustre. However, if you have > > configured > > lustre with a large disk block size and are writing netCDF files with lots > > of > > records and lots of record variables (variables that use an unlimited > > dimension) > > then you could be seeing a problem with writing such data a variable at at > > time > > instead of a record at a time: > > > > https://www.unidata.ucar.edu/jira/browse/NCF-142 > > > > You haven't said what version of the library you're using, but the fix > > above is > > in the nccopy utility in version 4.2, and in some of the utilities in the > > most > > recent release of NCO (the NetCDF Operators software from UC Irvine). > > > > Also, are you using parallel I/O? Use of parallel-netcdf may be a solution > > worth > > looking at if you're writing classic-format files, or the HDF5-based > > parallel I/O > > in netCDF-4 otherwise. > > > > If you have a small example that demonstrates the bad performance, we could > > try to > > reproduce it and diagnose the problem. > > > > --Russ > > > > --Russ > > > > Russ Rew UCAR Unidata Program > > address@hidden http://www.unidata.ucar.edu > > > > > > > > Ticket Details > > =================== > > Ticket ID: TCU-710461 > > Department: Support netCDF > > Priority: Normal > > Status: Closed > > > > > Russ Rew UCAR Unidata Program address@hidden http://www.unidata.ucar.edu Ticket Details =================== Ticket ID: TCU-710461 Department: Support netCDF Priority: Normal Status: Closed
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.