Showing entries tagged [ar-4]

NetCDF-4 AR-4 Timeseries Reads and Cache Sizes

Faster time series for the people!

What HDF5 chunk cache sizes are good for reading timeseries data in netCDF-4? I'm sure you have wondered - I know I have. Now we know: .5 to 4 MB. Bigger caches just slow this down. Now that came as a surprise!

The first three numbers are the chunk sizes of the 3 dimensions of the main data variable. The next two columns show the deflate (0 = none) and shuffle filter (0 = none). These are all the same for every run, because the same input file is used for all these runs - only the chunk cache size is changed when (re-)opening the file. The Unix file cache is cleared between each run.

The two times shows are the number of micro-seconds to read a time-series of the data, and the average time to read a time series after all time series are read.

*** Benchmarking pr_A1 file pr_A1_256_128_128.nc with various HDF5 chunk caches...
cs[0] cs[1] cs[2] cache(MB) deflate shuffle 1st_read_ser(us) avg_read_ser(us)
256   128   128   0.5       0       0       1279615          2589
256   128   128   1.0       0       0       1279613          2641
256   128   128   4.0       0       0       1298543          2789
256   128   128   16.0      0       0       1470297          34603
256   128   128   32.0      0       0       1470360          34541

Note that for cache sizes of < 4 MB, the first time series read took 1.2 - 1.3 s, and the average time was .0025 - .0028 s. But when I increased the chunk cache to 16 MB and 32MB, the time for the first read went to 1.5 s, and the avg time for all reads went to .035 s - an order of magnitude jump!

I have repeated these tests a number of times, always with this result for chunk cache buffers 16 MB and above.

I am planning on changing the netCDF-4.1 default to 1 MB, which is the HDF5 default. (I guess we should have listened to the HDF5 team in the first place.)

What Cache Size Should be Used to Read AR-4/AR-5 3D Data?

A question that has puzzled the greatest minds of history...

The not-yet-checked-in script nc_test4/run_bm_cache.sh tests reading a sample 3D data file with different sized caches.

Because of a weird increase in time for horizontal reads for 16MB cache size, I re-ran the test twice more to make sure I got the same results. And I did. No explanation why 16 MB works so poorly.

The current netCDF-4 default cache size is 4MB (which does fine), but I note that the original HDF5 default of 1 MB does even better. Perhaps I should just leave this cache alone as a default choice, and give users the HDF5 settings...

bash-3.2$ ./run_bm_cache.sh
*** Benchmarking pr_A1 file pr_A1_256_128_128.nc with various HDF5 chunk caches... cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_time_ser(us) 256   128   128   0.5       0       0       1291104 256   128   128   1.0       0       0      1298621 256   128   128   4.0       0       0       1306983 256   128   128   16.0      0       0       1472738 256   128   128   32.0      0       0       1497533
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us) 256   128   128   0.5       0       0       2308 256   128   128   1.0       0       0       2291 256   128   128   4.0       0       0       2453 256   128   128   16.0      0       0       11609
256   128   128   32.0      0       0       2603

SUCCESS!!!

bash-3.2$ ./run_bm_cache.sh 
*** Benchmarking pr_A1 file pr_A1_256_128_128.nc with various HDF5 chunk caches...
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_time_ser(us)
256   128   128   0.5       0       0       1290340
256   128   128   1.0       0       0       1281898
256   128   128   4.0       0       0       1306885
256   128   128   16.0      0       0       1470175
256   128   128   32.0      0       0       1497529
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us)
256   128   128   0.5       0       0       2298
256   128   128   1.0       0       0       2292
256   128   128   4.0       0       0       2335
256   128   128   16.0      0       0       11572
256   128   128   32.0      0       0       1841

SUCCESS!!!

bash-3.2$ ./run_bm_cache.sh 
*** Benchmarking pr_A1 file pr_A1_256_128_128.nc with various HDF5 chunk caches...
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_time_ser(us)
256   128   128   0.5       0       0       1298650
256   128   128   1.0       0       0       1298636
256   128   128   4.0       0       0       1565326
256   128   128   16.0      0       0       1497482
256   128   128   32.0      0       0       1497529

cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us)
256   128   128   0.5       0       0       2303
256   128   128   1.0       0       0       2287
256   128   128   4.0       0       0       2280
256   128   128   16.0      0       0       11584
256   128   128   32.0      0       0       1830

SUCCESS!!!

NetCDF-4 Horizontal Data Read Performance with Cache Clearing

Here are my numbers for doing horizontal reads with different cache sizes.

The times are the time to read each horizontal size, reading all of them.

I realize that reading just one horizontal slice will give different (much higher) times. The reason is that when I read the first horizontal level the various caches along the way will start filling up with the following levels, and then when I read them I get very low times. So reading it this way allows the caching to work. Reading just one horizontal level and stopping the program (to clear cache), will result in the worst case scenario for the caching.

But what should I be optimizing for? Reading all horizontal levels? Or just reading one level?

cs[0]   cs[1]   cs[2]   cache(MB)       deflate shuffle read_hor(us)
0       0       0       0.0             0       0       1527
1       16      32      1.0             0       0       1577
1       16      128     1.0             0       0       1618
1       16      256     1.0             0       0       1515
1       64      32      1.0             0       0       1579
1       64      128     1.0             0       0       1586
1       64      256     1.0             0       0       1584
1       128     32      1.0             0       0       1593
1       128     128     1.0             0       0       1583
1       128     256     1.0             0       0       1571
10      16      32      1.0             0       0       2128
10      16      128     1.0             0       0       2520
10      16      256     1.0             0       0       4309
10      64      32      1.0             0       0       4083
10      64      128     1.0             0       0       1751
10      64      256     1.0             0       0       1713
10      128     32      1.0             0       0       1692
10      128     128     1.0             0       0       1862
10      128     256     1.0             0       0       1749
256     16      32      1.0             0       0       10594
256     16      128     1.0             0       0       3681
256     16      256     1.0             0       0       3074
256     64      32      1.0             0       0       3656
256     64      128     1.0             0       0       3042
256     64      256     1.0             0       0      2773
256     128     32      1.0             0       0       3828
256     128     128     1.0             0       0       2335
256     128     256     1.0             0       0       1581
1024    16      32      1.0             0       0       35622
1024    16      128     1.0             0       0       2759
1024    16      256     1.0             0       0       2912
1024    64      32      1.0             0       0       2875
1024    64      128     1.0             0       0       2868
1024    64      256     1.0             0       0       3816
1024    128     32      1.0             0       0       2780
1024    128     128     1.0             0       0       2558
1024    128     256     1.0             0       0       1628
1560    16      32      1.0             0       0       154450
1560    16      128     1.0             0       0       3063
1560    16      256     1.0             0       0       3700

NetCDF-4 Performance With Cache Clearning

Now I have made some changes in my timing program, and I think I am getting better (i.e. more realistic) times.

Firstly, I now clear the cache before each read.

Secondly, I don't  try and read the horizontal sections and the timeseries in the same program run - whichever one is done first loads the cache for the other, and gives unrealistic times. Now I time these separately.

OK, so here are some timeseries reads times. The first row in netCDF classic data:

cs[0]   cs[1]   cs[2]   cache(MB)       deflate shuffle read_time_ser(us)
1       16      32      1.0             0       0       2434393
1       16      128     1.0             0       0       2411127
1       16      256     1.0             0       0       2358892
1       64      32      1.0             0       0       2455963
1       64      128     1.0             0       0       2510818
1       64      256     1.0             0       0       2482509
1       128     32      1.0             0       0       2480481
1       128     128     1.0             0       0       2489436
1       128     256     1.0             0       0       2504924
10      16      32      1.0             0       0       1146593
10      16      128     1.0             0       0       1156650
10      16      256     1.0             0       0       1259026
10      64      32      1.0             0       0       1150427
10      64      128     1.0             0       0       2384334
10      64      256     1.0             0       0       2438587
10      128     32      1.0             0       0       1258380
10      128     128     1.0             0       0       2521213
10      128     256     1.0             0       0       2528927
256     16      32      1.0             0       0       174062
256     16      128     1.0             0       0       358613
256     16      256     1.0             0       0       404662
256     64      32      1.0             0       0       400489
256     64      128     1.0             0       0       688528
256     64      256     1.0             0       0       1267521
256     128     32      1.0             0       0       404422
256     128     128     1.0             0       0       1374661
256     128     256     1.0             0       0       2445647
1024    16      32      1.0             0       0       78718
1024    16      128     1.0             0       0       346506
1024    16      256     1.0             0       0       378813
1024    64      32      1.0             0       0       340703
1024    64      128     1.0             0       0       665649
1024    64      256     1.0             0       0       1269936
1024    128     32      1.0             0       0       380796
1024    128     128     1.0             0       0       1269627
1024    128     256     1.0             0       0       2513330
1560    16      32      1.0             0       0       58124
1560    16      128     1.0             0       0       332641
1560    16      256     1.0             0       0       372587
1560    64      32      1.0             0       0       323445
1560    64      128     1.0             0       0       635165
1560    64      256     1.0             0       0       1263225
1560    128     32      1.0             0       0       372226
1560    128     128     1.0             0       0       1265999
1560    128     256     1.0             0       0       2712887

These numbers make more sense. It takes about 2.3 seconds to read the time series from the classic file.

Ed

Demonstrating Caching and Its Effect on Timing

The cache can really mess up benchmarking!

For example:

bash-3.2$ sudo bash clear_cache.sh && ./tst_ar4_3d -h -c
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us) read_time_ser(us)
64    256   128   4.0       0       0       66           2102
bash-3.2$ sudo bash clear_cache.sh && ./tst_ar4_3d -h cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us) read_time_ser(us)
64    256   128   4.0       0       0       1859         2324282

In the first run of tst_ar4_3d, with the -c option, the sample data file is first created and then read. The read time for the time series read is really low, because the file (having just been created) is still loaded in a disk cache somewhere in the OS or in the disk hardware.

When I clear the cache and rerun without the -c option, the sample data file is not created, it is assumed to already exist. Since the cache has been cleared, the time series read has to read the data from disk, and it takes 1000 times longer.

Well, that's why they invented disk caches.

This leads me to believe that my horizontal read times are fake too, because first I am doing a time series read, those loading some or all of the file into cache. I need to break that out into a separate test, I see, or perhaps make the order of the two tests controllable from the command line.

Oy, this benchmarking stuff is tricky business! I thought I had found some really good performance for netCDF-4, but now I am not sure. I need to look again more carefully and make sure that I am not being faked out by the caches.

Ed

Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Recent Entries:
Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
Browse by Topic
« April 2019
SunMonTueWedThuFriSat
 
2
3
4
5
6
7
9
10
11
12
13
14
16
17
18
19
20
21
23
24
25
26
27
28
29
30
    
       
Today