[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #JZW-805384]: Poor I/O performance on large blocksize filesystems



Gary,

I've added a couple of hundred lines of C code to nccopy to implement
this new rule of thumb:

  When accessing data from netCDF classic or 64-bit offset format
  files that have multiple record variables and a lot of records on a
  file system with large disk block size relative to a record's worth
  of data for one or more record variables, access the data a record
  at a time instead of a variable at a time.

The new version of nccopy will be in the upcoming 4.2 release.

The improvement this makes in a typical case of using nccopy to
convert or copy a 1.13 GB sample file that has lots of record
variables, on a system with 512KB disk block size is significant: a
factor of 24 in elapsed time.  I just tried it on the same sample file
and the /glade/scratch file system on one of NCAR CISL's platforms,
and there the result was a factor of 60 speedup.

It's unfortunate that this is not a problem that can be fixed in the
library, it's a matter of using the above rule of thumb in
applications that make use of the library, which requires a special
case.  However, it turns out that I didn't have to bother with testing
the disk block size at all in the new nccopy code.  The special case
code works fine and is no slower even when the disk block size is
small, like 4096 bytes.  It just speeds things up more for a larger
disk block size.

Actually we may eventually be able to speed things up in the netCDF
library as well for classic-format files by replacing all the
(over-)optimized code that currently uses read(2) and write(2) system
calls with ordinary stdio fread(3) and fwrite(3) calls, but that may
have to wait for version 4.3 ...

--Russ

> The slowdown occurs when all these conditions are met:
> 
> 1.  You're dealing with netCDF classic or 64-bit offset format
> files, not netCDF-4 or netCDF-4 classic model files.
> 
> 2.  You have an unlimited dimension and many record variables that
> use it.
> 
> 3.  The file system has a large block size, the atomic size for
> disk access.
> 
> In this case, doing things a variable at a time instead of a record at
> a time can be very slow, because accessing all the data in a variable
> (or some part of each record for a variable) typically reads each
> record multiple times, once for each record variable you're dealing
> with.  That's because the block size is larger than it needs to be to
> hold a record's worth of data for each variable, so accessing the nth
> record's data for a variable typically reads in more data than is
> needed.
> 
> Consider a case that's not too atypical: a block size of 2 MiBytes,
> 365 records, and 100 float record variables, each dimensioned
> (time=365, lat=73, lon=145) where time is the record dimension.  A
> record's worth of data for each variable is only 73*145*4 = 42340
> bytes, and each variable has 365 records.  So reading one variable of
> size 365*73*145*4 of about 15.5 Mbytes actually reads 365 disk blocks,
> which is 365*2097152 bytes or 765 Mbytes.  That's about 50 times more
> bytes read than needed.  If you operate on every variable in the file,
> one at a time, the result is 50 times more I/O than necessary, which
> explains why it might be 50 times slower than it would be if you used
> fixed size variables, stored contiguously, rather than record
> variables, stored in pieces scattered throughout the records of the
> file.
> 
> How can you deal with this to get efficient processing of such files?
> Here are some workarounds and solutions:
> 
> 1.  Don't use the unlimited dimension if you don't really need it.
> 
> 2.  Make sure the record size of each variable is <= a multiple
> of the disk block size.
> 
> 3.  Convert your record-oriented file to a file with only fixed size
> dimensions before using it in processing.  There's an nco operator
> for this, or you can use "nccopy -u infile outfile" to make the
> unlimited dimension a fixed size.
> 
> 4.  Change the processing algorithms to read input a record at a time
> instead of a variable at a time, processing all the record
> variables after each record has been read.
> 
> 5.  Use netCDF-4 classic model files or regular netCDF-4 files.  With
> the netCDF-4/HDF5 format, data is accessed in disk blocks, if
> stored contiguously, or by chunks for chunked data.  A chunk only
> contains data from a single variable.  Making chunks close to but
> less than or equal to a multiple of the disk block size insures
> that I/O will be fairly efficient.  If data is compressed, each
> chunk is compressed separately, so if compressed chunks are much
> smaller than the disk block size, inefficiencies may still occur.
> 
> I will be using approach number 4 in nccopy to detect and deal with
> this situation on systems with large block size.
> 
> --Russ
> 
> > Below are the two key posts of a discussion amongst Gary Strand, CISL,
> > and myself about NCO and netCDF performance issues on large blocksize
> > filesystems. Full thread at
> > https://sourceforge.net/projects/nco/forums/forum/9829/topic/4898620
> > Would appreciate any insight from Unidata on the problem.
> > Charlie
> > ************************************************************************
> > From Gary Strand 20111222:
> >
> > (Preface: I've been a very happy user of nco for many years)
> >
> > Technical details:
> > NCO netCDF Operators version "4.0.8" last modified 2011/04/26 built Oct
> > 18 2011
> > on mirage4 by jam
> > ncks version 4.0.8
> > Linked to netCDF library version 4.1.3, compiled Jul 26 2011 15:05:13
> > Copyright (C) 1995--2011 Charlie Zender
> >
> > Problem: This issue may be related to the NOFILL issue with netCDF 4.1.2; in
> > any case, on filesystems with large blocksizes (2M, for example,
> > 'lustre' and
> > NCAR's GLADE system) the I/O performance of even simple 'ncks' operations is
> > horrible - time-to-completion ratios (compared to smaller blocksize
> > filesystems)
> > of 300:1 or even 1500:1 are not uncommon.
> >
> > Investigation with NCAR CISL staff showed that a simple variable extraction
> > that takes about 20 seconds on a small blocksize filesystem takes about
> > 40 minutes
> > on the GLADE filesystem (120:1 ratio) and that the following was found:
> >
> > 12/20/11 3:57 PM JAM
> > I should add that the actual performance for the first 39 minute test
> > was around
> > 30MB/sec for reads and 12MB/sec for writes.  So nco may be doing
> > something else
> > inefficiently in addition to reading/writing extra data.
> >
> > 12/20/11 3:52 PM JAM
> > Hi, we've done some testing since first getting this ticket and have
> > found that
> > the performance of ncks on filesystems with large block sizes (most of Glade
> > is at a 2MB block size) is VERY bad and it seems to be reading/writing much
> > more data than necessary.
> >
> > The test we used was: "ncks -x -v TH
> > b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc out.nc"
> >
> > - The input file is 3.3GB and the output file is 1.1GB.  On an idle
> > system (storm4)
> > this command took around 39 minutes to complete when either input or output
> > file is on a Glade filesystem.  During this time 60GB was read from
> > glade and
> > 26GB was written.
> >
> > - Adding the -4 option to ncks resulted in the test taking 10 minutes to
> > complete
> > with 30GB read and 1.2GB written.
> >
> > - We also ran the same tests on a Lustre filesystem with a 1MB block
> > size and
> > saw similar bad performance.
> >
> > - Finally, running the same test with both input/output files in /tmp (local
> > drive, 4k block size) finished in 17 seconds.
> >
> > I don't know exactly what ncks does or how it does it, but there seems to be
> > an issue with large-block filesystems possibly causing it to read and write
> > overlapping blocks of data, resulting in the very large numbers of extra
> > bytes
> > read/written listed above.  Large-block filesystems also caused the
> > silent data
> > corruption issue with nco a few months back, which could be related.
> >
> > With this information and your plots, the effect of system load and
> > number of
> > users is not as significant as we originally thought, and the bad
> > performance
> > on glade is most likely related to the actual amount of data being
> > transferred
> > (86GB in the worst case).
> >
> > Is this an NCO problem or a netCDF-4 problem?
> > ************************************************************************
> > From Charlie Zender 20120103:
> >
> > Hi Gary,
> >
> > I have reproduced the problem you are experiencing using NCO on the
> > large block filesystem (LBF) named GLADE. The binaries in
> > ~zender/bin/[LINUXAMD64,AIX] improve the performance by about a factor
> > of two relative to NCO 4.0.8, but something lower in the software
> > stack than NCO, i.e.g, the netCDF library or the filesystem itself
> > seems to cause the gross degradation in performance relative to
> > NCO on smaller block filesystems.
> >
> > Without going into too much detail, and for the benefit and comment of
> > others following this issue, my conclusions about the slow performance
> > of NCO on LBFs (i.e., GLADE) on both AIX and Linux are:
> >
> > 0. NCO (and ncks in particular) doesn't use any fancy algorithms.
> > NCO uses only offical, documented netCDF API calls to do its work.
> > NCO does not pay attention to block-sizes. Unless hyperslabbing is
> > requested, NCO transfers entire variables with _one call_ (rather than
> > with continuous/consecutive calls) to nc_var_[get/put].
> > 1. Slow performance on LBFs is experienced when any version of NCO is
> > linked to any netCDF version including 4.1.3. I tested this with NCO
> > 3.9.6 on AIX (using the bluefire default, i.e., /usr/local/bin/ncks
> > which is linked to netCDF 3.6.2), and Gary or CISL tested this with
> > NCO 4.0.8 on Linux (unsure what library they used).
> > 2. NCO version 4.0.8 worsens the performance relative to other
> > versions of NCO, but changes in 4.0.8 do not cause the underlying
> > problem. 4.0.8 uses netCDF fill-mode to workaround the netCDF 4.1.2
> > (and all preceding versions) "NOFILL" bug. This causes 4.0.8 to write
> > (at least) twice as much data as other versions of NCO.
> > 3. NCO version 4.0.9, which is in beta and not yet released, improves
> > the performance by about a factor of two relative to 4.0.8. This is
> > consistent with the reversion of 4.0.9 to previous NCO behavior which
> > utilizes the netCDF NOFILL feature to reduce writes by (at least) a
> > factor of two. It is only safe to use NCO 4.0.9+ with netCDF
> > 4.1.3+. Otherwise the netCDF NOFILL bug may be triggered.
> > 4. NCO operations on LBFs are twice as fast on Linux as on AIX.
> > Extracting large datasets to netCDF3 files rather than netCDF4 files
> > takes ~2.5 times as long. These factors are independent, so the best
> > performance on large block filesystems is obtained with NCO 4.0.9 (or
> > any NCO except 4.0.8) under Linux writing netCDF4 files. The worst
> > performance will be with NCO 4.0.8 under AIX writing netCDF3 files.
> > 5. Improving NCO performance on LBFs may require more detailed
> > performance analysis and algorithms for sub-setting. An obvious place
> > to start is to use a blocksize-sensitive copy size. Recent versions of
> > nccopy use such an algorithm, I believe. However, this would require a
> > significant code refactoring for NCO, which is not currently funded.
> > However, NASA may fund implementation of groups in NCO. More on that
> > in coming weeks. Maybe those funds can leverage some of this work.
> > 6. Having written this much I'd like to hear from others before
> > blabbing-on. I wasn't aware there was any penalty for LBFs, so credit
> > goes to Gary for reporting the dramatic slow-downs on GLADE.
> > Any good ideas for methods to speed up netCDF3 writes on LBFs?
> > Are these performance penalties for LBFs better understood by others?
> >
> > Charlie
> >
> > Output of selected commands (extraneous stuff deleted):
> >
> > # Copying 3 GB takes ~1 minute with AIX on GLADE
> > zender@be1005en:~$ time /bin/cp
> > /glade/user/strandwg/NCO/b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc
> > ~/gary.nc
> > real    1m3.219s
> >
> > # Copying 3 GB takes ~30 seconds with Linux on GLADE, twice as fast as AIX
> > zender@mirage0:~$ time /bin/cp
> > /glade/user/strandwg/NCO/b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc
> > ~/gary.nc
> > real    0m30.812s
> >
> > # Test case takes ~8 minutes with ncks 3.9.6 on AIX
> > zender@be1005en:~$ /usr/local/bin/ncks --lbr
> > Linked to netCDF library version "3.6.2", compiled Apr  3 2007 14:19:36
> > zender@be1005en:~$ /usr/local/bin/ncks --vrs
> > NCO netCDF Operators version "3.9.6" last modified 2009/01/21 built Jan
> > 28 2009 on be1105en by ddvento
> > zender@be1005en:~$ time /usr/local/bin/ncks -O -D 3 -x -v TH ~/gary.nc
> > ~/out3_blf_3.9.6.nc
> > real    8m9.658s
> >
> > # Test case takes ~8 minutes with ncks 4.0.9 on AIX
> > zender@be1005en:~$ /glade/home/zender/bin/AIX/ncks --lbr
> > Linked to netCDF library version 4.1.3, compiled Aug 25 2011 08:32:40
> > zender@be1005en:~$ /glade/home/zender/bin/AIX/ncks --vrs
> > NCO netCDF Operators version 20120103 built Jan  3 2012 on
> > be1005en.ucar.edu by zender
> > zender@be1005en:~$ time /glade/home/zender/bin/AIX/ncks -O -D 3 -x -v TH
> > ~/gary.nc ~/out3_blf_4.0.9.nc
> > real    7m48.197s
> >
> > # netCDF4 improves speed relative to netCDF3 by factor of ~2.5 on AIX
> > zender@be1005en:~$ time /glade/home/zender/bin/AIX/ncks -O -4 -D 3 -x -v
> > TH ~/gary.nc ~/out4_blf_4.0.9.nc
> > real    2m42.123s
> >
> > # Test case takes ~4 minutes with ncks 4.0.9 on Linux
> > zender@mirage0:~$ /glade/home/zender/bin/LINUXAMD64/ncks/ncks --lbr
> > Linked to netCDF library version 4.1.3, compiled Jul 26 2011 15:05:13
> > zender@mirage0:~$ /glade/home/zender/bin/LINUXAMD64/ncks/ncks --vrs
> > NCO netCDF Operators version 20120103 built Jan 3 2012 on mirage0 by zender
> > zender@mirage0:~$ time /glade/home/zender/bin/LINUXAMD64/ncks -O -D 3 -x
> > -v TH ~/gary.nc ~/out3_mrg_4.0.9.nc
> > real 4m15.493s
> >
> > # netCDF4 improves speed relative to netCDF3 by factor of ~2.5 on Linux
> > zender@mirage0:~/nco$ time /glade/home/zender/bin/LINUXAMD64/ncks -O -4
> > -D 3 -x -v TH ~/gary.nc ~/out4_mrg_4.0.9.nc
> > real 1m44.345s
> > --
> > Charlie Zender, Department of Earth System Science
> > University of California, Irvine 949-891-2429  )'(
> >
> >
> 
> Russ Rew                                         UCAR Unidata Program
> address@hidden                      http://www.unidata.ucar.edu
> 
> 
Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: JZW-805384
Department: Support netCDF
Priority: Critical
Status: Closed