NetCDF-4 AR-4 Timeseries Reads and Cache Sizes
03 January 2010
Faster time series for the people!
What HDF5 chunk cache sizes are good for reading timeseries data
in netCDF-4? I'm sure you have wondered - I know I have. Now we know:
.5 to 4 MB. Bigger caches just slow this down. Now that came as a
surprise!
The first three numbers are the chunk sizes of the 3 dimensions of the
main data variable. The next two columns show the deflate (0 = none) and
shuffle filter (0 = none). These are all the same for every run,
because the same input file is used for all these runs - only the chunk
cache size is changed when (re-)opening the file. The Unix file cache is
cleared between each run.
The two times shows are the number of micro-seconds to read a
time-series of the data, and the average time to read a time series
after all time series are read.
*** Benchmarking pr_A1 file pr_A1_256_128_128.nc with various HDF5 chunk caches...
cs[0] cs[1] cs[2] cache(MB) deflate shuffle 1st_read_ser(us) avg_read_ser(us)
256 128 128 0.5 0 0 1279615 2589
256 128 128 1.0 0 0 1279613 2641
256 128 128 4.0 0 0 1298543 2789
256 128 128 16.0 0 0 1470297 34603
256 128 128 32.0 0 0 1470360 34541
Note that for cache sizes of < 4 MB, the first time series read took
1.2 - 1.3 s, and the average time was .0025 - .0028 s. But when I
increased the chunk cache to 16 MB and 32MB, the time for the first read
went to 1.5 s, and the avg time for all reads went to .035 s - an order
of magnitude jump!
I have repeated these tests a number of times, always with this result for chunk cache buffers 16 MB and above.
I am planning on changing the netCDF-4.1 default to 1 MB, which is the
HDF5 default. (I guess we should have listened to the HDF5 team in the
first place.)
Posted by $entry.creator.screenName
What Cache Size Should be Used to Read AR-4/AR-5 3D Data?
03 January 2010
A question that has puzzled the greatest minds of history...
The not-yet-checked-in script nc_test4/run_bm_cache.sh tests reading a sample 3D data file with different sized caches.
Because of a weird increase in time for horizontal reads for 16MB cache
size, I re-ran the test twice more to make sure I got the same results.
And I did. No explanation why 16 MB works so poorly.
The current netCDF-4 default cache size is 4MB (which does fine), but I
note that the original HDF5 default of 1 MB does even better. Perhaps I
should just leave this cache alone as a default choice, and give users
the HDF5 settings...
bash-3.2$ ./run_bm_cache.sh
*** Benchmarking pr_A1 file pr_A1_256_128_128.nc with various HDF5 chunk caches...
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_time_ser(us)
256 128 128 0.5 0 0 1291104
256 128 128 1.0 0 0 1298621
256 128 128 4.0 0 0 1306983
256 128 128 16.0 0 0 1472738
256 128 128 32.0 0 0 1497533
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us)
256 128 128 0.5 0 0 2308
256 128 128 1.0 0 0 2291
256 128 128 4.0 0 0 2453
256 128 128 16.0 0 0 11609
256 128 128 32.0 0 0 2603
SUCCESS!!!
bash-3.2$ ./run_bm_cache.sh
*** Benchmarking pr_A1 file pr_A1_256_128_128.nc with various HDF5 chunk caches...
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_time_ser(us)
256 128 128 0.5 0 0 1290340
256 128 128 1.0 0 0 1281898
256 128 128 4.0 0 0 1306885
256 128 128 16.0 0 0 1470175
256 128 128 32.0 0 0 1497529
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us)
256 128 128 0.5 0 0 2298
256 128 128 1.0 0 0 2292
256 128 128 4.0 0 0 2335
256 128 128 16.0 0 0 11572
256 128 128 32.0 0 0 1841
SUCCESS!!!
bash-3.2$ ./run_bm_cache.sh
*** Benchmarking pr_A1 file pr_A1_256_128_128.nc with various HDF5 chunk caches...
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_time_ser(us)
256 128 128 0.5 0 0 1298650
256 128 128 1.0 0 0 1298636
256 128 128 4.0 0 0 1565326
256 128 128 16.0 0 0 1497482
256 128 128 32.0 0 0 1497529
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us)
256 128 128 0.5 0 0 2303
256 128 128 1.0 0 0 2287
256 128 128 4.0 0 0 2280
256 128 128 16.0 0 0 11584
256 128 128 32.0 0 0 1830
SUCCESS!!!
Posted by $entry.creator.screenName
NetCDF-4 Horizontal Data Read Performance with Cache Clearing
03 January 2010
Here are my numbers for doing horizontal reads with different cache sizes.
The times are the time to read each horizontal size, reading all of them.
I realize that reading just one horizontal slice will give different
(much higher) times. The reason is that when I read the first horizontal
level the various caches along the way will start filling up with the
following levels, and then when I read them I get very low times. So
reading it this way allows the caching to work. Reading just one
horizontal level and stopping the program (to clear cache), will result
in the worst case scenario for the caching.
But what should I be optimizing for? Reading all horizontal levels? Or just reading one level?
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us)
0 0 0 0.0 0 0 1527
1 16 32 1.0 0 0 1577
1 16 128 1.0 0 0 1618
1 16 256 1.0 0 0 1515
1 64 32 1.0 0 0 1579
1 64 128 1.0 0 0 1586
1 64 256 1.0 0 0 1584
1 128 32 1.0 0 0 1593
1 128 128 1.0 0 0 1583
1 128 256 1.0 0 0 1571
10 16 32 1.0 0 0 2128
10 16 128 1.0 0 0 2520
10 16 256 1.0 0 0 4309
10 64 32 1.0 0 0 4083
10 64 128 1.0 0 0 1751
10 64 256 1.0 0 0 1713
10 128 32 1.0 0 0 1692
10 128 128 1.0 0 0 1862
10 128 256 1.0 0 0 1749
256 16 32 1.0 0 0 10594
256 16 128 1.0 0 0 3681
256 16 256 1.0 0 0 3074
256 64 32 1.0 0 0 3656
256 64 128 1.0 0 0 3042
256 64 256 1.0 0 0 2773
256 128 32 1.0 0 0 3828
256 128 128 1.0 0 0 2335
256 128 256 1.0 0 0 1581
1024 16 32 1.0 0 0 35622
1024 16 128 1.0 0 0 2759
1024 16 256 1.0 0 0 2912
1024 64 32 1.0 0 0 2875
1024 64 128 1.0 0 0 2868
1024 64 256 1.0 0 0 3816
1024 128 32 1.0 0 0 2780
1024 128 128 1.0 0 0 2558
1024 128 256 1.0 0 0 1628
1560 16 32 1.0 0 0 154450
1560 16 128 1.0 0 0 3063
1560 16 256 1.0 0 0 3700
Posted by $entry.creator.screenName
NetCDF-4 Performance With Cache Clearning
03 January 2010
Now I have made some changes in my timing program, and I think I am getting better (i.e. more realistic) times.
Firstly, I now clear the cache before each read.
Secondly, I don't try and read the horizontal sections and the
timeseries in the same program run - whichever one is done first loads
the cache for the other, and gives unrealistic times. Now I time these
separately.
OK, so here are some timeseries reads times. The first row in netCDF classic data:
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_time_ser(us)
1 16 32 1.0 0 0 2434393
1 16 128 1.0 0 0 2411127
1 16 256 1.0 0 0 2358892
1 64 32 1.0 0 0 2455963
1 64 128 1.0 0 0 2510818
1 64 256 1.0 0 0 2482509
1 128 32 1.0 0 0 2480481
1 128 128 1.0 0 0 2489436
1 128 256 1.0 0 0 2504924
10 16 32 1.0 0 0 1146593
10 16 128 1.0 0 0 1156650
10 16 256 1.0 0 0 1259026
10 64 32 1.0 0 0 1150427
10 64 128 1.0 0 0 2384334
10 64 256 1.0 0 0 2438587
10 128 32 1.0 0 0 1258380
10 128 128 1.0 0 0 2521213
10 128 256 1.0 0 0 2528927
256 16 32 1.0 0 0 174062
256 16 128 1.0 0 0 358613
256 16 256 1.0 0 0 404662
256 64 32 1.0 0 0 400489
256 64 128 1.0 0 0 688528
256 64 256 1.0 0 0 1267521
256 128 32 1.0 0 0 404422
256 128 128 1.0 0 0 1374661
256 128 256 1.0 0 0 2445647
1024 16 32 1.0 0 0 78718
1024 16 128 1.0 0 0 346506
1024 16 256 1.0 0 0 378813
1024 64 32 1.0 0 0 340703
1024 64 128 1.0 0 0 665649
1024 64 256 1.0 0 0 1269936
1024 128 32 1.0 0 0 380796
1024 128 128 1.0 0 0 1269627
1024 128 256 1.0 0 0 2513330
1560 16 32 1.0 0 0 58124
1560 16 128 1.0 0 0 332641
1560 16 256 1.0 0 0 372587
1560 64 32 1.0 0 0 323445
1560 64 128 1.0 0 0 635165
1560 64 256 1.0 0 0 1263225
1560 128 32 1.0 0 0 372226
1560 128 128 1.0 0 0 1265999
1560 128 256 1.0 0 0 2712887
These numbers make more sense. It takes about 2.3 seconds to read the time series from the classic file.
Ed
Posted by $entry.creator.screenName
Demonstrating Caching and Its Effect on Timing
02 January 2010
The cache can really mess up benchmarking!
For example:
bash-3.2$ sudo bash clear_cache.sh && ./tst_ar4_3d -h -c
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us) read_time_ser(us)
64 256 128 4.0 0 0 66 2102
bash-3.2$ sudo bash clear_cache.sh && ./tst_ar4_3d -h
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us) read_time_ser(us)
64 256 128 4.0 0 0 1859 2324282
In the first run of tst_ar4_3d, with the -c option, the sample data file
is first created and then read. The read time for the time series read
is really low, because the file (having just been created) is still
loaded in a disk cache somewhere in the OS or in the disk hardware.
When I clear the cache and rerun without the -c option, the sample data
file is not created, it is assumed to already exist. Since the cache has
been cleared, the time series read has to read the data from disk, and
it takes 1000 times longer.
Well, that's why they invented disk caches.
This leads me to believe that my horizontal read times are fake too,
because first I am doing a time series read, those loading some or all
of the file into cache. I need to break that out into a separate test, I
see, or perhaps make the order of the two tests controllable from the
command line.
Oy, this benchmarking stuff is tricky business! I thought I had found
some really good performance for netCDF-4, but now I am not sure. I need
to look again more carefully and make sure that I am not being faked
out by the caches.
Ed
Posted by $entry.creator.screenName